Computational Vexillology

Decoding National Aesthetics Through Data Science

Author

Alejandro Treny

Published

March 18, 2026

Introduction

What if we could treat national flags not as art, but as high-dimensional data? Every pixel encodes a decision: a color chosen, a symbol placed, a geometry defined. Collectively, the ~200 sovereign flags of the world form a rich visual corpus shaped by centuries of history, religion, revolution, and geography.

This project, Computational Vexillology, sets out to answer a provocative question:

Does the design of a country’s flag predict its destiny?

We will convert every flag into a mathematical fingerprint using two complementary lenses:

  • Computer Vision (OpenCV, scikit-image): extracting explicit, interpretable metrics like color warmth, visual entropy, and structural geometry.
  • Deep Learning (ResNet50): extracting latent style embeddings that capture abstract patterns a human might miss.

With these fingerprints in hand, we will:

  1. Map the Design Universe, using UMAP to project flags into a 2D space where visually similar flags cluster together.
  2. Rediscover History, testing whether unsupervised clustering can “accidentally” recover colonial empires, religious blocs, and pan-regional movements.
  3. Test Scientific Hypotheses, correlating flag aesthetics with geography, economics, and politics.

The entire analysis is contained in this document: reproducible code, interactive visualizations, and statistical findings in a single artifact.

0.1 Research Questions

Several hypotheses will guide our exploration. Among them:

  • Solar Determinism: do countries closer to the equator use “hotter” colors (reds, yellows) while northern nations prefer cooler palettes?
  • Complexity of Development: does national wealth correlate with flag simplicity, mirroring the minimalist trend in modern corporate branding?
  • The Revolutionary Diagonal: are diagonal lines and dynamic geometries more common in flags born from revolution or inequality?
  • The Colonial Ghost: can an algorithm, grouping flags purely by visual similarity, rediscover the footprint of the British Empire or the Crescent bloc?

These are starting points, not boundaries. As the data reveals its structure, we will follow wherever it leads.

Import core libraries
import numpy as np
import pandas as pd
import requests
import matplotlib.pyplot as plt
import plotly.express as px
import cairosvg
from PIL import Image
from pathlib import Path
from itables import show as itshow
import io
import warnings

warnings.filterwarnings("ignore")

print(f"NumPy:    {np.__version__}")
print(f"Pandas:   {pd.__version__}")
print(f"Pillow:   {Image.__version__}")
print("CairoSVG: ✓")
NumPy:    1.26.4
Pandas:   2.3.3
Pillow:   12.1.1
CairoSVG: ✓

With our environment ready, we begin by assembling the visual corpus.

1 Building the Flag Corpus

The first phase of this project is purely visual. Before introducing any socio-economic or geographic data, we want to let the flags speak for themselves. What patterns emerge when we look at 250 national designs as raw geometry and color?

Our source is FlagCDN, a public CDN that serves every national flag in SVG format, giving us precise vector definitions of colors and shapes rather than lossy rasterized pixels. We pair these with a minimal country index from the REST Countries API, just enough to label each flag with a name and ISO code.

1.1 Country Index

We first build a lightweight index of all countries: just the ISO alpha-2 code, the common name, and independence status. This gives us the list of flags to download and a way to label them. All other metadata (coordinates, inequality, population) will be loaded later when we turn to hypothesis testing.

Build country index from REST Countries API
fields = "name,cca2,independent"
response = requests.get(f"https://restcountries.com/v3.1/all?fields={fields}")
countries_raw = response.json()

df_index = pd.DataFrame([
    {
        "code": c["cca2"].lower(),
        "name": c["name"]["common"],
        "independent": c["independent"],
    }
    for c in countries_raw
]).sort_values("name").reset_index(drop=True)

print(f"{'Total entries:':<25} {len(df_index)}")
print(f"{'Independent states:':<25} {df_index['independent'].sum()}")
print(f"{'Territories/other:':<25} {(~df_index['independent']).sum()}")
itshow(df_index, lengthMenu=[5, 10, 25, 50], pageLength=5)
Total entries:            250
Independent states:       195
Territories/other:        55
Loading ITables v2.7.0 from the internet... (need help?)

The API returns 250 entries: 195 recognized independent states and 55 territories or dependencies. Every entry has a unique two-letter ISO code that will serve as our primary key throughout the analysis.

1.2 Downloading Flags as SVG

We download flags in SVG (Scalable Vector Graphics) rather than PNG. SVGs encode colors as exact hex values and shapes as mathematical paths, which means our color analysis operates on precise definitions rather than compression artifacts. When pixel-level processing is needed later (for neural networks, for instance), we rasterize the SVGs at a controlled resolution using cairosvg.

Download SVG flags from FlagCDN
flag_dir = Path("data/flags_svg")
flag_dir.mkdir(parents=True, exist_ok=True)

success, failed = 0, []
for code in df_index["code"]:
    path = flag_dir / f"{code}.svg"
    if path.exists():
        success += 1
        continue
    try:
        r = requests.get(f"https://flagcdn.com/{code}.svg", timeout=10)
        if r.status_code == 200:
            path.write_bytes(r.content)
            success += 1
        else:
            failed.append(code)
    except Exception:
        failed.append(code)

print(f"SVG flags downloaded: {success} / {len(df_index)}")
if failed:
    print(f"Failed: {failed}")
SVG flags downloaded: 250 / 250

1.3 A First Look

To confirm the pipeline works, let’s rasterize a handful of flags and display them. This also illustrates the diversity of aspect ratios we are dealing with: most flags are 2:3 or 1:2 rectangles, but Nepal’s double-pennant is an entirely different shape.

Preview a sample of downloaded flags
sample_codes = ["de", "es", "br", "np", "za", "gb"]
sample_names = {c: df_index.loc[df_index["code"] == c, "name"].values[0] for c in sample_codes}

fig, axes = plt.subplots(2, 3, figsize=(12, 5))
for ax, code in zip(axes.flat, sample_codes):
    svg_path = flag_dir / f"{code}.svg"
    png_data = cairosvg.svg2png(url=str(svg_path), output_width=640)
    img = Image.open(io.BytesIO(png_data)).convert("RGB")
    ax.set_facecolor("#f0f0f0")
    ax.imshow(np.array(img), aspect="equal")
    ax.set_title(f"{sample_names[code]} ({code.upper()})", fontsize=11)
    ax.axis("off")

plt.tight_layout()
plt.show()

A sample of six flags rasterized from SVG at 640px width. Note the variation in aspect ratios.

All 250 flags are now stored locally as SVGs. In the next section we begin feature extraction: converting each flag into a numerical fingerprint that captures its color palette, visual complexity, and geometric structure.

2 Feature Extraction

A flag is an image. An image is a grid of pixels. To compare flags mathematically, we need to reduce each image to a fixed-length vector of numbers, a fingerprint, where each number captures one meaningful property of the design. The choice of which properties to measure is the most important decision in the entire project, because downstream analyses (distances, clusters, hypothesis tests) can only discover patterns that our features are capable of encoding.

We draw our feature set from three sources:

  • Vexillological design principles, particularly the five rules published by the North American Vexillological Association (NAVA) in Good Flag, Bad Flag: keep it simple, use meaningful symbolism, use two or three basic colors, no lettering or seals, and be distinctive or be related.
  • Flag design taxonomy, as catalogued by Wikipedia’s List of National Flags by Design and Flag Families: the systematic classification of flags by structural elements (stripes, crosses, triangles, crescents, stars) and by historical lineage (Pan-African, Pan-Arab, Pan-Slavic, Nordic Cross, British Ensign, etc.).
  • Computer vision fundamentals: standard image descriptors from information theory (Shannon entropy), edge detection (Canny), and line detection (Hough Transform) that quantify visual properties without any domain-specific assumptions.

The result is a set of 19 features organized into five families. Below we describe each family, the metrics it contains, and the scientific or design rationale behind each one.

2.1 Family 1: Color Palette (8 metrics)

Color is the most immediately visible property of any flag. The heraldic tradition defines a strict vocabulary of tinctures: metals (gold/argent, rendered as yellow and white) and colors (gules/red, azure/blue, vert/green, sable/black, purpure/purple). Nearly every national flag draws its palette from this classical set.

We measure color in the HSV (Hue, Saturation, Value) color space rather than RGB. HSV separates chromatic content (hue) from brightness (value) and intensity (saturation), which makes it much easier to define categories like “red” or “warm” in a way that matches human perception.

The eight color palette metrics are:

Metric Definition Why it matters
warmth_score Fraction of chromatic pixels with warm hues (reds, oranges, yellows) Directly tests the Solar Determinism hypothesis: do equatorial nations favor hotter colors?
coolness_score Fraction of chromatic pixels with cool hues (blues, greens) The complement of warmth. Nordic and maritime nations may cluster here.
red_pct Fraction of total pixels that are red Red is the single most common flag color worldwide, associated with blood, revolution, and courage.
blue_pct Fraction of total pixels that are blue Blue symbolizes sky, sea, freedom, and vigilance. Common in maritime and democratic traditions.
green_pct Fraction of total pixels that are green Green appears in Pan-African, Pan-Arab, and Islamic flag traditions. Also associated with land and agriculture.
yellow_pct Fraction of total pixels that are yellow/gold Gold represents wealth, sun, and generosity in heraldic terms. Dominant in African and South American flags.
white_pct Fraction of total pixels that are white/silver White symbolizes peace, purity, and snow. Also serves as a background or fimbriation (border) color.
black_pct Fraction of total pixels that are black Black represents determination, heritage, and mourning. Prominent in Pan-African and revolutionary flags.

2.2 Family 2: Color Complexity (3 metrics)

NAVA’s third principle states: “Use two or three basic colors from the standard color set.” This is a measurable claim. A flag with two dominant colors is simpler and more recognizable than one with seven. Beyond counting colors, we also want to know how much those colors contrast with each other (high contrast aids recognition at a distance, which is the entire functional purpose of a flag), and whether the palette leans toward the aggressive end of the spectrum.

Metric Definition Why it matters
palette_complexity Number of significant color clusters found by K-Means quantization of the flag’s pixel data. This is not the number of colors a human would name (Afghanistan = 4 to the eye, but 7 at the pixel level due to its detailed emblem). It measures chromatic variety including gradients, shading, and fine artwork. Operationalizes NAVA’s “2-3 colors” rule at the pixel level. Flags with detailed coats of arms, seals, or multi-shade emblems will score higher than clean geometric designs.
color_contrast Maximum perceptual color distance (CIEDE2000) between any two dominant color clusters. Values typically range from 0 to 100, though highly chromatic pairs can slightly exceed 100. High contrast (e.g. black on white, red on green) makes a flag readable from far away. Low contrast suggests a monochromatic or analogous palette.
aggression_index Combined area fraction of red and black pixels Tests the Revolutionary Diagonal hypothesis: are flags born from violent independence movements more red-and-black? Also correlates with the Pan-African color tradition.

2.3 Family 3: Visual Complexity (3 metrics)

How “busy” is a flag? A tricolor with three solid blocks of color is among the simplest possible designs. A flag with a detailed coat of arms, animals, text, and ornamental borders is visually complex. NAVA’s first principle (“Keep it simple: the flag should be so simple that a child can draw it from memory”) and fourth principle (“No lettering or seals”) both relate to complexity.

We measure complexity from three complementary angles:

Metric Definition Why it matters
visual_entropy Shannon entropy of the grayscale intensity histogram, in bits An information-theoretic measure of pixel diversity. Simple flags (few gray levels) have low entropy; intricate designs (many gray levels from gradients, shadows, and detail) have high entropy.
edge_density Fraction of pixels detected as edges by the Canny algorithm A geometric measure of complexity. More edges mean more shapes, boundaries, and fine detail in the design. A solid tricolor has very few edges; a flag with a detailed eagle emblem has many.
spatial_entropy Entropy of color distribution across a spatial grid (the flag divided into a 4x4 grid of cells) Distinguishes between distributed complexity (patterns spread across the entire flag, like the USA’s stars-and-stripes) and localized complexity (a single emblem on a plain background, like Japan’s circle on white). Two flags can have identical visual_entropy but very different spatial_entropy.

2.4 Family 4: Geometric Structure (4 metrics)

Flag designs fall into well-known structural families: horizontal stripes (tribands, tricolors), vertical stripes, diagonal divisions, crosses, and more. These structural patterns carry historical meaning. Horizontal tricolors descend from the Dutch and French revolutionary traditions. Nordic crosses mark Scandinavian identity. Diagonal stripes are rarer and more dynamic, often signaling a break from colonial templates.

We detect dominant line angles in each flag using the Hough Transform, a classical computer vision algorithm that finds straight lines in an image. By classifying detected lines by their angle, we can quantify whether a flag’s geometry is primarily horizontal, vertical, or diagonal. We also measure bilateral symmetry, since most flags are designed to look the same when reflected horizontally.

Metric Definition Why it matters
horizontal_dominance Fraction of strong Hough lines that are near-horizontal (within 10 degrees of the horizon) Captures membership in the triband/tricolor family, the single largest design family in the world.
vertical_dominance Fraction of strong Hough lines that are near-vertical (within 10 degrees of the vertical axis) Distinguishes vertical tricolors (French tradition: France, Italy, Belgium, Ireland) from horizontal tribands.
diagonal_dominance Fraction of strong Hough lines that are neither horizontal nor vertical (the middle angular zone between 10 and 80 degrees) Rare and visually dynamic. Tests the Revolutionary Diagonal hypothesis: flags like Tanzania, Namibia, and the DRC use diagonals.
symmetry_score Pixel-wise correlation between the flag and its horizontal mirror image Most flags are designed to be read from both sides. Asymmetric flags (Nepal, Bhutan, flags with off-center emblems like Portugal or Sri Lanka) are structural outliers.

2.5 Family 5: Aspect Ratio (1 metric)

The shape of a flag is one of its most fundamental design decisions, yet it is often overlooked in computational analyses that resize all flags to a fixed square.

Metric Definition Why it matters
aspect_ratio Width divided by height of the rasterized flag Most flags have a 2:3 ratio (~1.50) or 1:2 (~2.00). Switzerland and Vatican City are square (1.00). Qatar is extremely elongated (~2.55). Nepal is the only flag taller than wide (~0.82). This single number separates entire design traditions.

2.6 Summary

Altogether, these 19 features span the space of what makes a flag visually distinctive. They are organized so that each family answers a different question:

Family Question N
Color Palette What colors does this flag use? 8
Color Complexity How chromatically complex is the palette, and how do its colors contrast? 3
Visual Complexity How busy is the design? 3
Geometric Structure What shapes and symmetries define its layout? 4
Aspect Ratio What is the flag’s shape? 1
Total 19

In the following subsections we implement each family as a Python function, extract all metrics for every flag, and visualize the results one family at a time.

2.7 Rasterization Helper

Before extracting any features, we need a utility function to convert SVG flags into pixel arrays. SVGs are vector graphics (mathematical descriptions of shapes), but our computer vision algorithms operate on rasters (grids of pixels). The function below uses cairosvg to render each SVG at a fixed width and returns a NumPy array in RGB format.

We choose a default width of 320 pixels. This is large enough to preserve fine detail (small stars, thin stripes) but small enough to keep computation fast across 250 flags. The height is determined automatically by the SVG’s native aspect ratio, which means Nepal’s flag will be taller than wide and Qatar’s will be very elongated. This is intentional: we want to preserve the true geometry of each flag rather than distorting it into a fixed square.

SVG to pixel array conversion
import cv2
from scipy.stats import entropy as shannon_entropy
from skimage.feature import canny
from skimage.transform import hough_line, hough_line_peaks

def rasterize_flag(svg_path, width=320):
    """Convert an SVG flag file into a NumPy RGB array.
    
    Parameters
    ----------
    svg_path : str or Path
        Path to the .svg file.
    width : int
        Target width in pixels. Height is computed from the SVG's
        native aspect ratio, so the flag is never distorted.
    
    Returns
    -------
    np.ndarray
        RGB image as a uint8 array of shape (H, W, 3).
    """
    png_data = cairosvg.svg2png(url=str(svg_path), output_width=width)
    img = Image.open(io.BytesIO(png_data)).convert("RGB")
    return np.array(img)

3 Color Palette

We now implement the first metric family. The function below takes an RGB image and returns eight numbers describing its color composition.

How it works, step by step:

  1. Convert RGB to HSV. The HSV color space separates hue (the “name” of the color, like red or blue), saturation (how vivid the color is), and value (how bright it is). This separation lets us define color categories using simple numeric ranges on the hue channel, which would be very awkward in RGB.

  2. Build a chromatic mask. Not every pixel carries meaningful color information. Very dark pixels (low value) look black regardless of their hue, and very pale pixels (low saturation, high value) look white. We exclude these achromatic pixels when computing warmth and coolness scores, so that a flag with a large white area does not dilute its chromatic profile.

  3. Classify hues. OpenCV encodes hue on a 0-179 scale (not 0-360). Red wraps around: hues near 0 and near 179 are both red. We define each color category as a range of hue values combined with minimum saturation and value thresholds to avoid false positives (a very dark, desaturated pixel with hue=120 is not really “green”).

  4. Compute area fractions. Each metric is simply the count of pixels matching a category divided by the total number of pixels (for individual colors) or by the number of chromatic pixels (for warmth/coolness).

Color palette extraction function
def compute_color_palette(img_rgb):
    """Extract 8 color palette metrics from an RGB flag image.
    
    All hue ranges are defined for OpenCV's 0-179 hue scale.
    Saturation and value thresholds prevent false classifications
    in near-black or near-white regions.
    
    Returns a dict with keys:
        warmth_score, coolness_score,
        red_pct, blue_pct, green_pct, yellow_pct, white_pct, black_pct
    """
    # Step 1: convert to HSV color space
    img_hsv = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2HSV)
    h = img_hsv[:, :, 0]  # hue:        0-179
    s = img_hsv[:, :, 1]  # saturation: 0-255
    v = img_hsv[:, :, 2]  # value:      0-255
    
    total_pixels = img_rgb.shape[0] * img_rgb.shape[1]
    
    # Step 2: chromatic mask (exclude near-black and near-white)
    # A pixel is "chromatic" if it has meaningful saturation and is not
    # too dark. Near-white pixels (high V, low S) are excluded by the
    # saturation threshold alone; we do NOT cap V because pure saturated
    # colors like RGB(255,0,0) have V=255 and must be counted.
    chromatic = (s > 25) & (v > 40)
    n_chromatic = max(chromatic.sum(), 1)  # avoid division by zero
    
    # Step 3: classify hues into warm and cool families
    # Warm: reds (wrapping around 0/179), oranges, yellows (H < 30 or H > 160)
    # Cool: greens and blues (H between 35 and 140). We start at 35 rather
    # than higher to capture dark greens (like Pakistan's or Norfolk Island's)
    # whose hue sits around H=70-75 in OpenCV's scale.
    warm_mask = ((h <= 30) | (h >= 160)) & chromatic
    cool_mask = ((h >= 35) & (h <= 140)) & chromatic
    
    # Step 4: individual color masks with tighter thresholds
    # Red wraps around: H <= 10 OR H >= 170, plus strong saturation and brightness
    red_mask   = ((h <= 10) | (h >= 170)) & (s > 80) & (v > 50)
    # Orange occupies a narrow hue band between red and yellow
    # (included in warmth but not tracked separately)
    # Yellow: H 20-35, must be bright and saturated to distinguish from brown
    yellow_mask = ((h >= 20) & (h <= 35)) & (s > 60) & (v > 100)
    # Green: H 35-85, moderate saturation minimum
    green_mask = ((h >= 35) & (h <= 85)) & (s > 40) & (v > 40)
    # Blue: H 85-135, moderate saturation minimum
    blue_mask  = ((h >= 85) & (h <= 135)) & (s > 40) & (v > 40)
    # Black: very low brightness regardless of hue
    black_mask = (v < 40)
    # White: very low saturation AND very high brightness
    white_mask = (s <= 20) & (v >= 230)
    
    return {
        "warmth_score":   round(warm_mask.sum() / n_chromatic, 4),
        "coolness_score":  round(cool_mask.sum() / n_chromatic, 4),
        "red_pct":         round(red_mask.sum() / total_pixels, 4),
        "blue_pct":        round(blue_mask.sum() / total_pixels, 4),
        "green_pct":       round(green_mask.sum() / total_pixels, 4),
        "yellow_pct":      round(yellow_mask.sum() / total_pixels, 4),
        "white_pct":       round(white_mask.sum() / total_pixels, 4),
        "black_pct":       round(black_mask.sum() / total_pixels, 4),
    }

3.1 Extraction

We iterate over every flag in our corpus, rasterize it, and compute the eight color palette metrics. The result is a DataFrame where each row is a flag and each column is a metric.

Run color palette extraction on all flags
records = []
for _, row in df_index.iterrows():
    svg_path = flag_dir / f"{row['code']}.svg"
    if not svg_path.exists():
        continue
    img = rasterize_flag(svg_path)
    metrics = {"code": row["code"], "name": row["name"]}
    metrics.update(compute_color_palette(img))
    records.append(metrics)

df_palette = pd.DataFrame(records)
print(f"Color palette extracted: {df_palette.shape[0]} flags x {df_palette.shape[1]} columns")
itshow(df_palette, lengthMenu=[5, 10, 25, 50], pageLength=5)
Color palette extracted: 250 flags x 10 columns
Loading ITables v2.7.0 from the internet... (need help?)

3.2 Color Landscape Overview

Before looking at individual metrics, let’s get an overview of the entire color landscape. The stacked bar chart below shows the six major color proportions for every flag, sorted from warmest to coolest. Each vertical sliver is one flag; the height of each color band shows how much of the flag’s area that color occupies.

Stacked bar chart of color composition across all flags
color_cols = ["red_pct", "blue_pct", "green_pct", "yellow_pct", "black_pct", "white_pct"]
palette_colors = {
    "red_pct": "#DC143C", "blue_pct": "#4169E1", "green_pct": "#228B22",
    "yellow_pct": "#FFD700", "black_pct": "#2F2F2F", "white_pct": "#D3D3D3"
}

df_sorted = df_palette.sort_values("warmth_score", ascending=False).reset_index(drop=True)

fig, ax = plt.subplots(figsize=(14, 5))
bottom = np.zeros(len(df_sorted))
for col in color_cols:
    ax.bar(range(len(df_sorted)), df_sorted[col], bottom=bottom,
           color=palette_colors[col], label=col.replace("_pct", "").title(), width=1.0)
    bottom += df_sorted[col].values

ax.set_xlabel("Flags (sorted by warmth score, warmest on the left)")
ax.set_ylabel("Proportion of flag area")
ax.set_title("Color Composition of 250 National Flags")
ax.legend(loc="upper right", framealpha=0.9)
ax.set_xlim(-0.5, len(df_sorted) - 0.5)
ax.set_ylim(0, 1)
ax.set_xticks([])
plt.tight_layout()
plt.show()

Color composition of all 250 flags, sorted by warmth score. Each vertical bar is one flag. The six bands show the proportion of the flag’s area occupied by each major color.

3.3 Summary Statistics

The table below shows the distribution of each color palette metric across all 250 flags. Pay attention to the means and the spread (std): they tell us which colors dominate the world’s flags on average and how much variation exists.

Descriptive statistics for all 8 color palette metrics
palette_metrics = ["warmth_score", "coolness_score", "red_pct", "blue_pct",
                   "green_pct", "yellow_pct", "white_pct", "black_pct"]
df_palette[palette_metrics].describe().round(4)
warmth_score coolness_score red_pct blue_pct green_pct yellow_pct white_pct black_pct
count 250.0000 250.0000 250.0000 250.0000 250.0000 250.0000 250.0000 250.0000
mean 0.5152 0.4842 0.2881 0.2478 0.1358 0.0888 0.1792 0.0463
std 0.3233 0.3231 0.2474 0.2865 0.2076 0.1423 0.1926 0.1096
min 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
25% 0.2084 0.2673 0.0764 0.0000 0.0000 0.0000 0.0049 0.0000
50% 0.5029 0.4959 0.2981 0.1612 0.0018 0.0065 0.0985 0.0000
75% 0.7232 0.7912 0.3990 0.3947 0.2992 0.1183 0.3217 0.0016
max 1.0000 1.0000 0.9742 0.9770 0.9594 0.7588 0.8623 0.4871

3.4 Warmth vs Coolness

The warmth and coolness scores partition the chromatic content of each flag into two opposing camps. Since a pixel can be warm, cool, or neither (e.g., purple, which sits between red and blue), these two scores do not necessarily sum to 1. The scatter plot below shows each flag as a point in warmth-coolness space, with its dominant color indicated by marker color.

Interactive scatter plot of warmth vs coolness for every flag
# Determine the dominant color for each flag (for marker coloring)
dom_cols = ["red_pct", "blue_pct", "green_pct", "yellow_pct", "white_pct", "black_pct"]
dom_labels = {
    "red_pct": "Red", "blue_pct": "Blue", "green_pct": "Green",
    "yellow_pct": "Yellow", "white_pct": "White", "black_pct": "Black"
}
dom_color_scale = {
    "Red": "#DC143C", "Blue": "#4169E1", "Green": "#228B22",
    "Yellow": "#FFD700", "White": "#999999", "Black": "#2F2F2F"
}
df_warmth_plot = df_palette.copy()
df_warmth_plot["dominant_color"] = df_palette[dom_cols].idxmax(axis=1).map(dom_labels)

fig = px.scatter(
    df_warmth_plot, x="warmth_score", y="coolness_score",
    color="dominant_color",
    color_discrete_map=dom_color_scale,
    hover_name="name",
    hover_data={"warmth_score": ":.2f", "coolness_score": ":.2f", "dominant_color": True},
    labels={"warmth_score": "Warmth Score", "coolness_score": "Coolness Score",
            "dominant_color": "Dominant Color"},
    title="Flag Color Temperature: Warmth vs Coolness",
    opacity=0.75, width=800, height=600,
)

# Diagonal reference line (warm = cool)
fig.add_shape(type="line", x0=0, y0=1, x1=1, y1=0,
              line=dict(color="#ccc", width=1, dash="dash"))

fig.update_layout(xaxis_range=[-0.05, 1.05], yaxis_range=[-0.05, 1.05])
fig.show()

Each dot is a flag. Flags in the top-left are dominated by warm colors (reds, oranges, yellows). Flags in the bottom-right are cool (blues, greens). Flags near the origin have mostly achromatic palettes (white, black).

3.5 Which Colors Dominate?

The bar chart below aggregates: for each of the six major colors, what is the average proportion across all 250 flags? This tells us the “global average flag” color recipe.

Average color prevalence across all 250 flags
mean_colors = df_palette[color_cols].mean().sort_values(ascending=True)
color_labels = [c.replace("_pct", "").title() for c in mean_colors.index]
bar_colors = [palette_colors[c] for c in mean_colors.index]

fig, ax = plt.subplots(figsize=(8, 4.5))
bars = ax.barh(color_labels, mean_colors.values, color=bar_colors, edgecolor="white", height=0.6)

# Add percentage labels on each bar
for bar, val in zip(bars, mean_colors.values):
    ax.text(bar.get_width() + 0.005, bar.get_y() + bar.get_height() / 2,
            f"{val:.1%}", va="center", fontsize=10)

ax.set_xlabel("Mean proportion of flag area")
ax.set_title("Average Color Prevalence Across 250 National Flags")
ax.set_xlim(0, mean_colors.max() * 1.25)
plt.tight_layout()
plt.show()

Mean proportion of each color across all 250 flags. Red and white dominate, followed by blue. Yellow and green are less common, and black is the rarest major color.

3.6 The Warmest and Coolest Flags

Finally, let’s look at the actual flags sitting at the extremes. Which flags are the most dominated by warm colors? Which are the coolest? And which flags have almost no chromatic content at all (dominated by white and black)?

Display the warmest, coolest, and most achromatic flags
# Achromatic dominance: how much of the flag is white + black (non-chromatic)
df_palette["achromatic_pct"] = df_palette["white_pct"] + df_palette["black_pct"]

groups = [
    ("warmth_score", True, "Warmest Flags"),
    ("coolness_score", True, "Coolest Flags"),
    ("achromatic_pct", True, "Most Achromatic Flags"),
]

fig, axes = plt.subplots(3, 5, figsize=(14, 8))
for row_idx, (metric, largest, title) in enumerate(groups):
    subset = df_palette.nlargest(5, metric) if largest else df_palette.nsmallest(5, metric)
    for col_idx, (_, flag_row) in enumerate(subset.iterrows()):
        ax = axes[row_idx, col_idx]
        img = rasterize_flag(flag_dir / f"{flag_row['code']}.svg", width=320)
        ax.set_facecolor("#f0f0f0")
        ax.imshow(img, aspect="equal")
        score = flag_row[metric]
        ax.set_title(f"{flag_row['name']}\n{metric}: {score:.3f}", fontsize=8)
        ax.axis("off")
    axes[row_idx, 0].set_ylabel(title, fontsize=9, rotation=0, labelpad=70, va="center")

plt.suptitle("Color Palette Extremes", fontsize=13, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

Top row: the 5 flags with the highest warmth scores. Middle row: the 5 flags with the highest coolness scores. Bottom row: the 5 flags with the highest combined white + black area (most achromatic).

3.7 Discussion

Several patterns stand out from the data.

Red is the world’s flag color. With a mean of 28.8% of flag area, red leads every other color. 39 flags dedicate more than half their area to red, led by China (97.4%), Morocco (97.0%), and Turkey (94.1%). This is consistent with heraldic tradition, where gules (red) is the most popular tincture, and with the cross-cultural associations of red: blood, courage, revolution, and sacrifice.

Blue is close behind, but distributed differently. Blue averages 24.8%, nearly tied with red, but it behaves differently. 108 flags (43% of the corpus) contain essentially zero blue, while 49 flags are more than half blue. Blue appears in an all-or-nothing pattern: when a flag uses blue, it tends to use a lot of it. The blue-dominant list reads like a map of the Pacific Ocean (Micronesia, Palau, Nauru, Australia, New Zealand) plus the British Ensign family (flags with Union Jacks on blue fields).

White is a supporting color, not a leading one. At 17.9%, white is the third most common color on average, but only 15 flags are majority-white. The white-dominant flags are revealing: Cyprus, Japan, South Korea, Israel, and Georgia are all flags with a simple symbol on a plain white field. White functions less as a “color” and more as negative space.

Green clusters in specific traditions. Green averages 13.6%, and 145 flags (58%) use essentially no green at all. But the green-dominant flags tell a clear story: Saudi Arabia (95.9%), Turkmenistan (83.7%), Bangladesh (79.0%), and Pakistan (68.8%) all belong to the Islamic design tradition. Brazil (69.1%) and Nigeria (66.9%) represent the Pan-African and Latin American strands respectively. Green is the most “culturally loaded” color in the corpus.

Yellow and black are rare and specialized. Yellow averages only 8.9%, and only 4 flags dedicate more than half their area to it. Black averages 4.6%, and no flag in the world is majority-black (Libya comes closest at 48.7%, followed by Papua New Guinea at 48.0%). Black appears most prominently in the Pan-African tricolor tradition (black-red-green or black-yellow-green) and in European tribands (Germany, Belgium).

The warm-cool balance is remarkably even. Mean warmth (0.515) and mean coolness (0.484) are nearly equal, suggesting that the world’s flags, taken as a whole, are chromatically balanced. However, the distribution is bimodal rather than normal: 45 flags are purely warm (warmth > 0.95) and 22 are purely cool (coolness > 0.95), while 171 flags (68%) mix both warm and cool hues. Very few flags are chromatically neutral.

The Solar Determinism hypothesis has an early lead. The purely warm flags include many equatorial and tropical nations (Vietnam, Turkey, Morocco, China, Indonesia, Kyrgyzstan), while the purely cool flags include Scandinavian (Finland, Iceland), maritime (Micronesia, Palau, Nauru), and temperate nations (Estonia, Greece, Israel). This is suggestive, but not yet conclusive. We will need geographic coordinates and a proper statistical test to evaluate this hypothesis rigorously in the hypothesis testing section.

A note on coverage. Our six named color categories (red, blue, green, yellow, white, black) collectively account for 98.6% of all pixels across 250 flags. The remaining 1.4% falls into transitional regions that no single category claims: orange (H 11-19 in OpenCV, straddling red and yellow), muted tones produced by anti-aliasing at stripe boundaries, grays that are too saturated for the white mask but too desaturated for any chromatic mask, and the occasional purple. Ireland and Ivory Coast are the most affected: their orange stripes occupy a third of the flag area and are not counted by any individual color percentage. Crucially, these pixels are not lost to the analysis, the warmth_score and coolness_score metrics do cover the full chromatic range, as do the K-Means-based metrics in Family 2. The gap is confined to the six named-color breakdowns, which are intentionally strict to avoid false positives.

With the color palette extracted and explored, we have a clear picture of what colors each flag uses and how the world’s flags distribute across the warm-cool spectrum. In the next section, we move to Color Complexity: not just which colors a flag uses, but how many and how contrastingly.

4 Color Complexity

Family 1 told us which colors appear in each flag. Family 2 asks a different question: how chromatically complex is the flag’s palette, and how much do its colors contrast with each other?

This family operationalizes two of NAVA’s five principles. Principle 3 says “Use two or three basic colors from the standard color set.” We can now test that rule quantitatively: do most flags actually use 2-3 colors, or is there a long tail of complex palettes? Principle 1 says “Keep it simple”, and a flag with high color contrast between its dominant blocks is visually simpler (more readable at a distance) than one where colors blur together.

We also introduce the aggression index, a compound metric that combines the area of red and black pixels. These two colors carry the heaviest symbolic weight in flag design: red for blood and revolution, black for mourning, resistance, and Pan-African identity. Their combined area gives us a single number to test the Revolutionary Diagonal hypothesis later.

How it works, step by step:

  1. Color quantization with K-Means. We reduce each flag’s millions of possible RGB values to a small set of representative colors by running K-Means clustering with k=8 in RGB space. After clustering, we discard clusters that represent less than 1.5% of the total area (noise, anti-aliasing artifacts). We chose 1.5% rather than a higher threshold because some flags have small but important symbols, China’s yellow stars, for instance, occupy only about 2.5% of the flag area. The number of surviving clusters is palette_complexity. Note that this is not the number of colors a human would name when looking at the flag. A human sees Afghanistan as a 4-color flag (black, red, green, white), but the emblem’s fine artwork introduces brown, gold, and intermediate shades that push the pixel-level count to 7. This is intentional: palette_complexity measures how chromatically varied the design actually is, not how many colors appear in the official specification.

  2. Perceptual color contrast. For each pair of dominant color clusters, we convert from RGB to the CIELAB color space and compute the CIEDE2000 color difference (\(\Delta E_{00}\)). Unlike luminance-only measures (such as the WCAG contrast ratio), CIEDE2000 accounts for differences in hue and saturation as well as lightness, so two colors that are equally dark but very different in hue, such as red and green, correctly score as highly contrastive. We report the maximum \(\Delta E_{00}\) across all pairs of dominant colors. Values typically range from 0 (identical colors) to approximately 100 (black vs white), though maximally different chromatic pairs can slightly exceed 100.

  3. Aggression index. Simply the sum of red_pct and black_pct from Family 1. We compute it here rather than deriving it later so that Family 2’s DataFrame is self-contained.

Color complexity extraction function
from sklearn.cluster import MiniBatchKMeans
from skimage.color import rgb2lab, deltaE_ciede2000

def compute_color_complexity(img_rgb, red_pct, black_pct):
    """Extract 3 color complexity metrics from an RGB flag image.
    
    Parameters
    ----------
    img_rgb : np.ndarray
        RGB image array of shape (H, W, 3).
    red_pct : float
        Pre-computed red area fraction from Family 1 (avoids recomputation).
    black_pct : float
        Pre-computed black area fraction from Family 1.
    
    Returns
    -------
    dict with keys: palette_complexity, color_contrast, aggression_index
    """
    # Step 1: flatten pixels and run K-Means with k=8
    pixels = img_rgb.reshape(-1, 3).astype(np.float64)
    kmeans = MiniBatchKMeans(n_clusters=8, random_state=42, n_init=3, batch_size=1024)
    labels = kmeans.fit_predict(pixels)
    centers = kmeans.cluster_centers_  # shape (8, 3)
    
    # Compute the proportion of pixels in each cluster
    total = len(labels)
    proportions = np.array([(labels == i).sum() / total for i in range(8)])
    
    # Step 2: keep only clusters above the 1.5% noise threshold.
    # We use 1.5% rather than a higher cutoff because some flags have
    # small but visually important symbols (e.g., China's yellow stars
    # occupy ~2.5% of the flag area, and Micronesia's white stars ~2.4%).
    # A threshold of 3% would erase these, collapsing the flag to 1 color.
    significant = proportions >= 0.015
    n_distinct = int(significant.sum())
    sig_centers = centers[significant]
    sig_proportions = proportions[significant]
    
    # Step 3: compute maximum perceptual color distance (CIEDE2000)
    # We convert each dominant color to CIELAB and compute delta-E between all
    # pairs. Unlike WCAG luminance contrast, CIEDE2000 captures differences in
    # hue and saturation as well as lightness, so red vs green (both dark)
    # correctly scores as highly contrastive.
    max_contrast = 0.0
    if len(sig_centers) >= 2:
        # rgb2lab expects (H, W, 3) float64 in [0, 1]
        lab_colors = rgb2lab(sig_centers.reshape(1, -1, 3) / 255.0)[0]  # shape (N, 3)
        for i in range(len(lab_colors)):
            for j in range(i + 1, len(lab_colors)):
                de = deltaE_ciede2000(
                    lab_colors[i].reshape(1, 1, 3),
                    lab_colors[j].reshape(1, 1, 3),
                )[0, 0]
                if de > max_contrast:
                    max_contrast = de
    
    # Step 4: aggression index = red + black area from Family 1
    aggression = round(red_pct + black_pct, 4)
    
    return {
        "palette_complexity": n_distinct,
        "color_contrast": round(max_contrast, 2),
        "aggression_index": aggression,
    }

4.1 Extraction

We now run the extraction loop. Since the aggression index reuses red_pct and black_pct from Family 1, we pull those values from df_palette rather than recomputing them.

Run color complexity extraction on all flags
records_complexity = []
for _, row in df_palette.iterrows():
    svg_path = flag_dir / f"{row['code']}.svg"
    if not svg_path.exists():
        continue
    img = rasterize_flag(svg_path)
    metrics = {"code": row["code"], "name": row["name"]}
    metrics.update(compute_color_complexity(img, row["red_pct"], row["black_pct"]))
    records_complexity.append(metrics)

df_complexity = pd.DataFrame(records_complexity)
print(f"Color complexity extracted: {df_complexity.shape[0]} flags x {df_complexity.shape[1]} columns")
itshow(df_complexity, lengthMenu=[5, 10, 25, 50], pageLength=5)
Color complexity extracted: 250 flags x 5 columns
Loading ITables v2.7.0 from the internet... (need help?)

4.2 Palette Complexity

NAVA’s third principle recommends 2-3 colors. Our palette_complexity metric captures something broader: the number of distinct color clusters at the pixel level, including shading and emblem detail. Let’s see how the world’s flags distribute on this scale.

Distribution of palette complexity across all flags
fig, ax = plt.subplots(figsize=(9, 5))

counts = df_complexity["palette_complexity"].value_counts().sort_index()
bars = ax.bar(counts.index, counts.values, color="#4C72B0", edgecolor="white", width=0.7)

# Highlight NAVA's recommended zone (2-3 colors)
ax.axvspan(1.65, 3.35, alpha=0.12, color="#2ca02c", label="NAVA recommended (2-3)")
ax.axvline(3, color="#2ca02c", linestyle="--", linewidth=1, alpha=0.5)

# Label each bar with its count
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width() / 2, height + 1,
            str(int(height)), ha="center", fontsize=10)

ax.set_xlabel("Palette Complexity (number of pixel-level color clusters)")
ax.set_ylabel("Number of Flags")
ax.set_title("Palette Complexity of National Flags")
ax.legend(loc="upper right")
ax.set_xticks(range(1, 9))
plt.tight_layout()
plt.show()

Distribution of palette complexity (number of significant pixel-level color clusters) across all 250 flags. The dashed green band marks NAVA’s recommended range of 2-3 colors. Flags with detailed emblems or coats of arms push the count above 5.
NAVA compliance statistics
n = len(df_complexity)
n_nava = ((df_complexity["palette_complexity"] >= 2) & (df_complexity["palette_complexity"] <= 3)).sum()
n_under = (df_complexity["palette_complexity"] < 2).sum()
n_over = (df_complexity["palette_complexity"] > 3).sum()
median_colors = df_complexity["palette_complexity"].median()
mean_colors = df_complexity["palette_complexity"].mean()

print(f"NAVA-compliant (2-3 colors): {n_nava} flags ({n_nava/n:.0%})")
print(f"Fewer than 2 colors:         {n_under} flags")
print(f"More than 3 colors:          {n_over} flags ({n_over/n:.0%})")
print(f"Median colors:               {median_colors:.0f}")
print(f"Mean colors:                 {mean_colors:.1f}")
NAVA-compliant (2-3 colors): 143 flags (57%)
Fewer than 2 colors:         0 flags
More than 3 colors:          107 flags (43%)
Median colors:               3
Mean colors:                 3.6

The numbers tell us something interesting: while NAVA recommends 2-3 colors, palette_complexity often exceeds that range. This does not mean most flags are badly designed. It reflects the gap between official color specifications and pixel-level reality. Afghanistan officially has 4 colors, but its emblem renders as 7 pixel clusters. A clean tricolor like France scores exactly 3. The metric correctly separates geometrically simple designs from artistically detailed ones, which is precisely the dimension we want to capture.

4.3 Color Contrast

How different are a flag’s dominant colors from each other? High contrast (e.g., white on black, red on green, blue on yellow) makes a flag readable from far away, which is the original functional purpose of a flag: identification at a distance on a battlefield or a ship. Low contrast suggests a monochromatic or analogous palette that prioritizes subtlety over raw visibility.

We measure contrast using CIEDE2000 (\(\Delta E_{00}\)), the international standard for perceptual color difference. Unlike luminance-only measures (such as the WCAG contrast ratio, which only captures lightness differences), CIEDE2000 operates in the CIELAB color space and accounts for hue and saturation as well as lightness. This means red vs dark green, two colors with nearly identical luminance but very different hues, correctly scores as highly contrastive.

Distribution of maximum perceptual color contrast
fig, ax = plt.subplots(figsize=(9, 5))

ax.hist(df_complexity["color_contrast"], bins=25, color="#E07B39",
        edgecolor="white", alpha=0.85)

# Perceptual distance reference thresholds
ax.axvline(25, color="#cc7700", linestyle="--", linewidth=1.2, alpha=0.7, label="Low contrast (<25)")
ax.axvline(50, color="#2ca02c", linestyle="--", linewidth=1.2, alpha=0.7, label="Strong contrast (50)")
ax.axvline(75, color="#1a6600", linestyle="--", linewidth=1.2, alpha=0.7, label="Very strong contrast (75)")

ax.set_xlabel("Maximum Color Contrast (CIEDE2000 $\Delta E_{00}$)")
ax.set_ylabel("Number of Flags")
ax.set_title("Perceptual Color Contrast in National Flags")
ax.legend(loc="upper left")
plt.tight_layout()
plt.show()

Distribution of the maximum CIEDE2000 color difference between any two dominant colors. A value of 0 means all dominant colors are perceptually identical. Pure black vs white scores approximately 100, though maximally different chromatic pairs (like deep blue vs vivid yellow) can slightly exceed that. Values above 50 indicate strongly distinct palettes.
Perceptual contrast statistics
high_contrast = (df_complexity["color_contrast"] >= 75).sum()
strong_contrast = ((df_complexity["color_contrast"] >= 50) & (df_complexity["color_contrast"] < 75)).sum()
moderate_contrast = ((df_complexity["color_contrast"] >= 25) & (df_complexity["color_contrast"] < 50)).sum()
low_contrast = (df_complexity["color_contrast"] < 25).sum()

print(f"Very strong contrast (>= 75):  {high_contrast} flags ({high_contrast/n:.0%})")
print(f"Strong contrast (50-75):       {strong_contrast} flags ({strong_contrast/n:.0%})")
print(f"Moderate contrast (25-50):     {moderate_contrast} flags ({moderate_contrast/n:.0%})")
print(f"Low contrast (< 25):           {low_contrast} flags ({low_contrast/n:.0%})")
print(f"Mean color contrast:           {df_complexity['color_contrast'].mean():.1f}")
print(f"Median color contrast:         {df_complexity['color_contrast'].median():.1f}")
Very strong contrast (>= 75):  118 flags (47%)
Strong contrast (50-75):       105 flags (42%)
Moderate contrast (25-50):     26 flags (10%)
Low contrast (< 25):           1 flags (0%)
Mean color contrast:           73.1
Median color contrast:         73.0

Flag designers are instinctive contrast engineers. The distribution reveals that most national flags achieve strong perceptual separation between their dominant colors, which makes sense: a flag that cannot be distinguished at 200 meters fails its primary purpose.

The few flags with low contrast deserve individual attention. Let’s see which they are:

Flags with the lowest perceptual color contrast
low_df = df_complexity.nsmallest(8, "color_contrast")

fig, axes = plt.subplots(1, 8, figsize=(16, 2.5))
for ax, (_, row) in zip(axes, low_df.iterrows()):
    img = rasterize_flag(flag_dir / f"{row['code']}.svg", width=320)
    ax.set_facecolor("#f0f0f0")
    ax.imshow(img, aspect="equal")
    ax.set_title(f"{row['name']}\n$\Delta E$={row['color_contrast']:.1f}", fontsize=8)
    ax.axis("off")

plt.suptitle("Lowest Contrast Flags", fontsize=11, fontweight="bold")
plt.tight_layout()
plt.show()

The 8 flags with the lowest CIEDE2000 color contrast. These designs use colors that are perceptually close to each other, whether through similar hues, similar lightness, or both.

4.4 The Aggression Index

The aggression index combines the two most symbolically intense colors in flag design: red (blood, revolution, sacrifice) and black (mourning, resistance, heritage). A high aggression index does not literally mean the country is aggressive. It captures a specific aesthetic posture: the visual weight of colors historically associated with struggle and defiance.

Distribution of the aggression index (red + black area)
fig, ax = plt.subplots(figsize=(9, 5))

ax.hist(df_complexity["aggression_index"], bins=30, color="#C44E52",
        edgecolor="white", alpha=0.85)

ax.axvline(df_complexity["aggression_index"].mean(), color="#333",
           linestyle="--", linewidth=1.2, label=f"Mean: {df_complexity['aggression_index'].mean():.2f}")
ax.axvline(df_complexity["aggression_index"].median(), color="#666",
           linestyle=":", linewidth=1.2, label=f"Median: {df_complexity['aggression_index'].median():.2f}")

ax.set_xlabel("Aggression Index (red_pct + black_pct)")
ax.set_ylabel("Number of Flags")
ax.set_title("The Aggression Index: Red + Black Dominance")
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()

Distribution of the aggression index across 250 flags. The index ranges from 0 (no red or black at all) to nearly 1 (almost entirely red and black). The bimodal shape suggests two design populations: flags that avoid red/black entirely, and flags that lean heavily into them.

Which flags sit at the extremes? The strip below shows the 8 most aggressive and 8 most peaceful designs:

Most and least aggressive flags by the aggression index
fig, axes = plt.subplots(2, 8, figsize=(18, 4.5))

for row_idx, (subset, title) in enumerate([
    (df_complexity.nlargest(8, "aggression_index"), "Most Aggressive"),
    (df_complexity.nsmallest(8, "aggression_index"), "Most Peaceful"),
]):
    for col_idx, (_, flag_row) in enumerate(subset.iterrows()):
        ax = axes[row_idx, col_idx]
        img = rasterize_flag(flag_dir / f"{flag_row['code']}.svg", width=320)
        ax.set_facecolor("#f0f0f0")
        ax.imshow(img, aspect="equal")
        ax.set_title(f"{flag_row['name']}\n{flag_row['aggression_index']:.2f}", fontsize=7)
        ax.axis("off")
    axes[row_idx, 0].set_ylabel(title, fontsize=9, rotation=0, labelpad=65, va="center")

plt.suptitle("Aggression Index Extremes", fontsize=12, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

Top row: the 8 flags with the highest aggression index (most red + black). Bottom row: the 8 flags with the lowest aggression index (least red and black). The aggressive row reads like a list of revolution and resistance; the peaceful row is dominated by blue, green, and yellow palettes.

4.5 Complexity vs Contrast

Do flags with more colors also tend to have higher contrast, or is there a trade-off? The scatter plot below maps each flag in the space of color count vs contrast ratio, with the aggression index encoded as marker color (red = high aggression, blue = low). This gives us a three-dimensional view of color complexity in a single chart.

Interactive scatter: color count vs contrast, colored by aggression
fig = px.scatter(
    df_complexity, x="palette_complexity", y="color_contrast",
    color="aggression_index",
    color_continuous_scale="RdYlBu_r",
    range_color=[0, 1],
    hover_name="name",
    hover_data={"palette_complexity": True, "color_contrast": ":.1f",
                "aggression_index": ":.2f"},
    labels={"palette_complexity": "Palette Complexity",
            "color_contrast": "Max Color Contrast (CIEDE2000 ΔE₀₀)",
            "aggression_index": "Aggression Index"},
    title="Palette Complexity vs Color Contrast",
    opacity=0.75, width=800, height=600,
)

fig.update_layout(xaxis=dict(dtick=1))
fig.show()

Each dot is a flag. X-axis: palette complexity (pixel-level color clusters). Y-axis: maximum perceptual color contrast. Color: aggression index (red = high, blue = low). Flags in the upper-right are both chromatically complex and high-contrast. Flags in the lower-left are simple and low-contrast.

4.6 Warmth Meets Aggression

Before moving on, let’s bridge Family 1 and Family 2 with one final visualization. Is there a relationship between a flag’s color temperature (warmth score) and its aggression (red + black area)? Intuitively there should be: warm flags tend to be red, and red is a major component of the aggression index. But the relationship is not guaranteed to be linear, since a flag can be warm through yellow/orange rather than red, and aggression also includes black.

Interactive scatter connecting Family 1 (warmth) to Family 2 (aggression)
# Merge Family 1 and Family 2 for cross-referencing
df_cross = df_palette[["code", "name", "warmth_score"]].merge(
    df_complexity[["code", "aggression_index", "color_contrast"]], on="code"
)

corr = df_cross["warmth_score"].corr(df_cross["aggression_index"])

fig = px.scatter(
    df_cross, x="warmth_score", y="aggression_index",
    color="color_contrast",
    color_continuous_scale="viridis",
    hover_name="name",
    hover_data={"warmth_score": ":.2f", "aggression_index": ":.2f",
                "color_contrast": ":.1f"},
    labels={"warmth_score": "Warmth Score (Family 1)",
            "aggression_index": "Aggression Index (Family 2)",
            "color_contrast": "Color Contrast (ΔE₀₀)"},
    title=f"Color Temperature vs Aggression (Pearson r = {corr:.3f})",
    opacity=0.75, width=800, height=600,
)

fig.update_layout(xaxis_range=[-0.05, 1.05], yaxis_range=[-0.05, 1.05])
fig.show()

Each dot is a flag. X-axis: warmth score (Family 1). Y-axis: aggression index (Family 2). The positive correlation confirms that warm flags tend to be aggressive, but the scatter reveals many exceptions: warm-but-peaceful flags (orange/yellow dominance) and cool-but-aggressive flags (flags that combine blue with significant black areas).

4.7 Discussion

Several findings emerge from Family 2.

Palette complexity captures more than the “official” color count. A human looking at Afghanistan’s flag sees 4 colors (black, red, green, white). Wikipedia’s specification lists 6 (counting two shades of red and two shades of green in the emblem). Our metric finds 7, because the emblem’s fine artwork, mosque, wheat wreath, Arabic script, introduces browns, golds, and intermediate shades at the pixel level. This gap between human perception, official specification, and pixel reality is the point. We deliberately named this metric palette_complexity rather than “number of colors” to signal that it measures chromatic variety in the rendered image, not the count a person would give. Flags with clean geometric designs (tricolors, bicolors) score 2-3. Flags with detailed coats of arms or multi-shade emblems score 5-7. This is exactly the dimension that NAVA’s simplicity principle targets: a flag that a child can draw from memory will have low palette complexity, while one requiring an artist will score high.

Flag designers are master contrast engineers. When measured with CIEDE2000 (which captures hue and saturation differences, not just lightness), most flags show strong perceptual separation between their dominant colors. Red-and-green flags like Bangladesh and Maldives, which would score near 1:1 on a luminance-only scale, correctly register as highly contrastive here because red and green sit on opposite sides of the CIELAB color space. Centuries before color science formalized these distinctions, flag designers were already exploiting the full perceptual gamut to maximize visibility at a distance.

The aggression index is bimodal. Flags cluster into two groups: those that avoid red and black (many Islamic, Pacific, and blue-tradition flags), and those that lean into them (Pan-African, revolutionary, and European flags). The bimodal distribution is a first hint that flags do not occupy a single design continuum but fall into distinct stylistic traditions.

Warmth and aggression are correlated but not identical. Warm flags tend to be aggressive (high red content drives both metrics), but the scatter reveals a meaningful population of exceptions. Warm-but-peaceful flags use orange and yellow instead of red (e.g., some Asian flags with gold). Cool-but-aggressive flags combine blue with black (e.g., Estonia). These exceptions are exactly the kind of flags that will become interesting outliers in clustering.

With both the palette and complexity of each flag now quantified, we have 11 features (8 + 3) describing the color dimension of flag design. In the next section, we shift from color to geometry: how busy is the design, and what structural patterns define it?

5 Visual Complexity

Families 1 and 2 described the color of each flag. Family 3 asks a different question: how visually complex is the design itself? A tricolor with three solid blocks of color is among the simplest possible flag designs. A flag featuring a detailed coat of arms, animals, weapons, text, and ornamental borders is visually complex. NAVA’s first principle, “Keep it simple: the flag should be so simple that a child can draw it from memory”, and fourth principle, “No lettering or seals”, both relate directly to this dimension.

We measure complexity from three complementary angles, each capturing something the others miss:

  1. Visual entropy (Shannon entropy of the grayscale histogram). This is an information-theoretic measure. A perfectly uniform image has zero entropy; a perfectly random image has maximum entropy. Flags with few distinct gray levels (solid stripes) score low; flags with many gray levels (gradients, shadows, fine artwork) score high.

  2. Edge density (fraction of edge pixels detected by the Canny algorithm). This is a geometric measure. Every boundary between colors, every line in an emblem, every contour of a coat of arms contributes an edge pixel. A simple tricolor has edges only at the stripe boundaries; a flag with a detailed eagle emblem is dense with edges.

  3. Spatial entropy (entropy of the color distribution across a 4x4 grid). This captures where the complexity lives. Two flags can have identical visual entropy, but one distributes its complexity evenly (like the USA’s stars and stripes) while the other concentrates it in one spot (like Japan’s red circle on white). Spatial entropy distinguishes these two cases.

Visual complexity extraction function
def compute_visual_complexity(img_rgb):
    """Extract 3 visual complexity metrics from an RGB flag image.
    
    Parameters
    ----------
    img_rgb : np.ndarray
        RGB image array of shape (H, W, 3).
    
    Returns
    -------
    dict with keys: visual_entropy, edge_density, spatial_entropy
    """
    # Step 1: convert to grayscale for entropy and edge detection.
    # We use OpenCV's standard weighted formula (0.299R + 0.587G + 0.114B)
    # rather than a simple average, because it better matches human
    # brightness perception.
    gray = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2GRAY)
    
    # Step 2: visual entropy -- Shannon entropy of the grayscale histogram.
    # We compute a 256-bin histogram (one bin per possible gray value),
    # normalize it to a probability distribution, and compute entropy.
    # The result is in bits. Maximum possible is log2(256) = 8.0 bits for
    # a perfectly uniform histogram (every gray value equally likely).
    hist, _ = np.histogram(gray.ravel(), bins=256, range=(0, 256))
    hist_prob = hist / hist.sum()
    vis_entropy = float(shannon_entropy(hist_prob, base=2))
    
    # Step 3: edge density -- Canny edge fraction.
    # Canny is a multi-stage edge detector: it smooths the image with a
    # Gaussian, computes gradients, applies non-maximum suppression, and
    # uses hysteresis thresholding. The sigma parameter controls the
    # smoothing scale. We use sigma=1.0, a standard choice that balances
    # noise rejection with detail preservation.
    edges = canny(gray, sigma=1.0)
    edge_dens = float(edges.sum() / edges.size)
    
    # Step 4: spatial entropy -- how uniformly is color complexity distributed?
    # We divide the flag into a 4x4 grid of cells (16 cells total).
    # For each cell, we compute the mean RGB color, then measure how
    # diverse these 16 mean colors are using Shannon entropy on the
    # distribution of unique color clusters.
    h, w = img_rgb.shape[:2]
    rows, cols = 4, 4
    cell_h, cell_w = h // rows, w // cols
    
    # Collect the mean color of each grid cell
    cell_colors = []
    for r in range(rows):
        for c in range(cols):
            cell = img_rgb[r*cell_h:(r+1)*cell_h, c*cell_w:(c+1)*cell_w]
            cell_colors.append(cell.mean(axis=(0, 1)))
    
    cell_colors = np.array(cell_colors)  # shape (16, 3)
    
    # Quantize cell colors into a small number of bins by rounding
    # each channel to the nearest 32 (gives 8 levels per channel).
    # Then count how many cells share the same quantized color.
    quantized = (cell_colors / 32).astype(int)
    # Convert to hashable tuples for counting
    color_tuples = [tuple(q) for q in quantized]
    from collections import Counter
    counts = Counter(color_tuples)
    probs = np.array(list(counts.values())) / len(color_tuples)
    spat_entropy = float(shannon_entropy(probs, base=2))
    
    return {
        "visual_entropy": round(vis_entropy, 4),
        "edge_density":   round(edge_dens, 4),
        "spatial_entropy": round(spat_entropy, 4),
    }

5.1 Extraction

We run the visual complexity extraction across all flags. This family does not depend on any previous results, so we iterate directly over the country index.

Run visual complexity extraction on all flags
records_visual = []
for _, row in df_palette.iterrows():
    svg_path = flag_dir / f"{row['code']}.svg"
    if not svg_path.exists():
        continue
    img = rasterize_flag(svg_path)
    metrics = {"code": row["code"], "name": row["name"]}
    metrics.update(compute_visual_complexity(img))
    records_visual.append(metrics)

df_visual = pd.DataFrame(records_visual)
print(f"Visual complexity extracted: {df_visual.shape[0]} flags x {df_visual.shape[1]} columns")
itshow(df_visual, lengthMenu=[5, 10, 25, 50], pageLength=5)
Visual complexity extracted: 250 flags x 5 columns
Loading ITables v2.7.0 from the internet... (need help?)

5.2 Visual Entropy

How much information does a flag’s grayscale image contain? A tricolor made of three solid blocks has very few distinct gray levels and low entropy. A flag with gradients, shading, emblem detail, and anti-aliased curves has many gray levels and high entropy.

Distribution of visual entropy across flags
fig, ax = plt.subplots(figsize=(9, 5))

ax.hist(df_visual["visual_entropy"], bins=30, color="#4C72B0",
        edgecolor="white", alpha=0.85)

ax.axvline(df_visual["visual_entropy"].mean(), color="#333",
           linestyle="--", linewidth=1.2,
           label=f"Mean: {df_visual['visual_entropy'].mean():.2f} bits")
ax.axvline(df_visual["visual_entropy"].median(), color="#666",
           linestyle=":", linewidth=1.2,
           label=f"Median: {df_visual['visual_entropy'].median():.2f} bits")

ax.set_xlabel("Visual Entropy (bits)")
ax.set_ylabel("Number of Flags")
ax.set_title("Visual Entropy of National Flags")
ax.legend(loc="upper left")
plt.tight_layout()
plt.show()

Distribution of visual entropy (Shannon entropy of the grayscale histogram, in bits). Low entropy means the flag has few distinct brightness levels (simple geometric designs). High entropy means the flag contains many brightness levels (detailed artwork, gradients, fine textures).
Flags with the highest and lowest visual entropy
fig, axes = plt.subplots(2, 8, figsize=(18, 4.5))

for row_idx, (subset, title) in enumerate([
    (df_visual.nlargest(8, "visual_entropy"), "Most Complex"),
    (df_visual.nsmallest(8, "visual_entropy"), "Simplest"),
]):
    for col_idx, (_, flag_row) in enumerate(subset.iterrows()):
        ax = axes[row_idx, col_idx]
        img = rasterize_flag(flag_dir / f"{flag_row['code']}.svg", width=320)
        ax.set_facecolor("#f0f0f0")
        ax.imshow(img, aspect="equal")
        ax.set_title(f"{flag_row['name']}\n{flag_row['visual_entropy']:.2f} bits", fontsize=7)
        ax.axis("off")
    axes[row_idx, 0].set_ylabel(title, fontsize=9, rotation=0, labelpad=65, va="center")

plt.suptitle("Visual Entropy Extremes", fontsize=12, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

Top row: the 8 flags with the highest visual entropy (most information-rich grayscale profiles). These flags contain detailed coats of arms, complex heraldry, or multi-element designs. Bottom row: the 8 flags with the lowest entropy (simplest grayscale profiles). These are clean geometric designs with very few distinct brightness levels.

5.3 Edge Density

Edge density tells us how many boundaries and contours exist in the flag’s design. The Canny edge detector finds pixels where brightness changes sharply, stripe boundaries, emblem outlines, text contours, and ornamental detail.

Distribution of edge density across flags
fig, ax = plt.subplots(figsize=(9, 5))

ax.hist(df_visual["edge_density"], bins=30, color="#55A868",
        edgecolor="white", alpha=0.85)

ax.axvline(df_visual["edge_density"].mean(), color="#333",
           linestyle="--", linewidth=1.2,
           label=f"Mean: {df_visual['edge_density'].mean():.4f}")
ax.axvline(df_visual["edge_density"].median(), color="#666",
           linestyle=":", linewidth=1.2,
           label=f"Median: {df_visual['edge_density'].median():.4f}")

ax.set_xlabel("Edge Density (fraction of edge pixels)")
ax.set_ylabel("Number of Flags")
ax.set_title("Edge Density of National Flags")
ax.legend(loc="upper right")
plt.tight_layout()
plt.show()

Distribution of edge density (fraction of pixels detected as edges). Flags cluster in the low range: most designs are geometrically clean. A long right tail captures flags with detailed emblems, coats of arms, and ornamental borders.
Flags with the highest and lowest edge density
fig, axes = plt.subplots(2, 8, figsize=(18, 4.5))

for row_idx, (subset, title) in enumerate([
    (df_visual.nlargest(8, "edge_density"), "Most Edges"),
    (df_visual.nsmallest(8, "edge_density"), "Fewest Edges"),
]):
    for col_idx, (_, flag_row) in enumerate(subset.iterrows()):
        ax = axes[row_idx, col_idx]
        img = rasterize_flag(flag_dir / f"{flag_row['code']}.svg", width=320)
        ax.set_facecolor("#f0f0f0")
        ax.imshow(img, aspect="equal")
        ax.set_title(f"{flag_row['name']}\n{flag_row['edge_density']:.4f}", fontsize=7)
        ax.axis("off")
    axes[row_idx, 0].set_ylabel(title, fontsize=9, rotation=0, labelpad=65, va="center")

plt.suptitle("Edge Density Extremes", fontsize=12, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

Top row: the 8 flags with the highest edge density. These are the most geometrically detailed designs in the world’s flag corpus: coats of arms, text, animals, intricate heraldic devices. Bottom row: the 8 flags with the lowest edge density. These are the cleanest, most minimal designs – often bicolors or single-field flags with very few boundaries.

5.4 Spatial Entropy

Where does the complexity live inside the flag? Spatial entropy answers this by dividing each flag into a 4x4 grid of cells, computing the mean color of each cell, and measuring how diverse those 16 cell colors are. A flag where all 16 cells have the same color (solid field) scores near zero. A flag where every cell is a different color (like a busy patchwork) scores high.

This metric distinguishes two flags that might have identical visual entropy but very different spatial structures. The USA has stars and stripes distributed across the entire flag (high spatial entropy). Laos has a central circle on a solid background (low spatial entropy). Both might have similar grayscale complexity, but their spatial distribution of detail is very different.

Distribution of spatial entropy across flags
fig, ax = plt.subplots(figsize=(9, 5))

ax.hist(df_visual["spatial_entropy"], bins=30, color="#C44E52",
        edgecolor="white", alpha=0.85)

ax.axvline(df_visual["spatial_entropy"].mean(), color="#333",
           linestyle="--", linewidth=1.2,
           label=f"Mean: {df_visual['spatial_entropy'].mean():.2f} bits")
ax.axvline(df_visual["spatial_entropy"].median(), color="#666",
           linestyle=":", linewidth=1.2,
           label=f"Median: {df_visual['spatial_entropy'].median():.2f} bits")

ax.set_xlabel("Spatial Entropy (bits)")
ax.set_ylabel("Number of Flags")
ax.set_title("Spatial Entropy of National Flags")
ax.legend(loc="upper left")
plt.tight_layout()
plt.show()

Distribution of spatial entropy (entropy of the 4x4 grid color distribution). Low values indicate uniform or single-element designs where most cells share the same color. High values indicate patterned or multi-region designs where the 16 grid cells show diverse colors.
Flags with the highest and lowest spatial entropy
fig, axes = plt.subplots(2, 8, figsize=(18, 4.5))

for row_idx, (subset, title) in enumerate([
    (df_visual.nlargest(8, "spatial_entropy"), "Most Distributed"),
    (df_visual.nsmallest(8, "spatial_entropy"), "Most Uniform"),
]):
    for col_idx, (_, flag_row) in enumerate(subset.iterrows()):
        ax = axes[row_idx, col_idx]
        img = rasterize_flag(flag_dir / f"{flag_row['code']}.svg", width=320)
        ax.set_facecolor("#f0f0f0")
        ax.imshow(img, aspect="equal")
        ax.set_title(f"{flag_row['name']}\n{flag_row['spatial_entropy']:.2f} bits", fontsize=7)
        ax.axis("off")
    axes[row_idx, 0].set_ylabel(title, fontsize=9, rotation=0, labelpad=65, va="center")

plt.suptitle("Spatial Entropy Extremes", fontsize=12, fontweight="bold", y=1.01)
plt.tight_layout()
plt.show()

Top row: the 8 flags with the highest spatial entropy. These designs distribute complexity across the entire flag surface. Bottom row: the 8 flags with the lowest spatial entropy. These are the most spatially uniform designs, where nearly all grid cells share the same dominant color.

5.5 Complexity Landscape

Let’s combine all three metrics into a single view. The scatter plot below maps each flag in the space of edge density vs visual entropy, with spatial entropy encoded as marker color. This reveals how the three complementary dimensions of complexity relate to each other.

Interactive scatter: edge density vs visual entropy, colored by spatial entropy
fig = px.scatter(
    df_visual, x="visual_entropy", y="edge_density",
    color="spatial_entropy",
    color_continuous_scale="magma",
    hover_name="name",
    hover_data={"visual_entropy": ":.2f", "edge_density": ":.4f",
                "spatial_entropy": ":.2f"},
    labels={"visual_entropy": "Visual Entropy (bits)",
            "edge_density": "Edge Density",
            "spatial_entropy": "Spatial Entropy (bits)"},
    title="The Complexity Landscape of National Flags",
    opacity=0.75, width=800, height=600,
)

fig.show()

Each dot is a flag. X-axis: visual entropy (grayscale information content). Y-axis: edge density (geometric detail). Color: spatial entropy (how distributed the complexity is). Flags in the upper-right are both information-rich and edge-dense – the most visually complex designs in the world.

5.6 Palette Complexity Meets Visual Complexity

How does Family 2’s palette_complexity (number of distinct color clusters) relate to Family 3’s visual complexity? Flags with detailed coats of arms should score high on both: they have many colors and many edges. But clean geometric designs can have high color variety without high edge density (think of the South African flag: 6 colors, very few edges). This cross-family scatter tests whether color complexity and geometric complexity are redundant or complementary.

Interactive cross-family scatter: palette complexity vs edge density
df_cross_visual = df_complexity[["code", "name", "palette_complexity"]].merge(
    df_visual[["code", "visual_entropy", "edge_density", "spatial_entropy"]], on="code"
)

corr = df_cross_visual["palette_complexity"].corr(df_cross_visual["edge_density"])

fig = px.scatter(
    df_cross_visual, x="palette_complexity", y="edge_density",
    color="visual_entropy",
    color_continuous_scale="viridis",
    hover_name="name",
    hover_data={"palette_complexity": True, "edge_density": ":.4f",
                "visual_entropy": ":.2f"},
    labels={"palette_complexity": "Palette Complexity (Family 2)",
            "edge_density": "Edge Density (Family 3)",
            "visual_entropy": "Visual Entropy (bits)"},
    title=f"Color Complexity vs Geometric Complexity (Pearson r = {corr:.3f})",
    opacity=0.75, width=800, height=600,
)

fig.update_layout(xaxis=dict(dtick=1))
fig.show()

Each dot is a flag. X-axis: palette complexity (Family 2). Y-axis: edge density (Family 3). Color: visual entropy (Family 3). The positive correlation confirms that flags with more colors also tend to have more edges, but the scatter is wide – many flags with 3-4 colors span the full range of edge density, showing that the two metrics capture genuinely different design dimensions.

5.7 Discussion

Several findings emerge from Family 3.

Most flags are simple. Visual entropy clusters in a narrow band, and edge density is heavily right-skewed: the median flag has very few edge pixels. This confirms NAVA’s observation that simplicity is the dominant design principle in flag design worldwide. The few flags with high edge density stand out as clear outliers, these are almost always flags with detailed coats of arms, heraldic devices, or text inscriptions that violate NAVA’s “keep it simple” and “no lettering or seals” principles.

Edge density is the sharpest discriminator. While visual entropy captures grayscale diversity (which can be elevated by gradients and subtle shading), edge density directly measures the number of sharp boundaries in the design. The most edge-dense flags are immediately recognizable as the world’s most visually intricate designs, while the least edge-dense are the cleanest geometric compositions.

Spatial entropy reveals structural families. Flags with high spatial entropy distribute their complexity across the entire surface (striped patterns, multi-panel designs, star fields). Flags with low spatial entropy concentrate all their detail in one region or repeat the same color everywhere. This metric will be particularly useful for distinguishing between “busy but structured” designs (like the USA) and “busy but concentrated” designs (like flags with a central emblem on a plain field).

Color complexity and geometric complexity are correlated but not redundant. The positive correlation between palette complexity and edge density makes intuitive sense: more colors means more boundaries. But the wide scatter shows that many flags break this pattern. Clean geometric flags like South Africa or Mauritius pack many colors into few edges, while detailed monochromatic emblems create many edges from few colors. The two metric families capture genuinely different design dimensions, which validates our decision to measure both.

With 14 features now extracted (8 + 3 + 3), we have covered both the color and complexity dimensions of flag design. In the next section, we turn to Geometric Structure: the spatial organization of lines and symmetries that define each flag’s layout.

6 Geometric Structure

Flags encode their identity not just in color, but in geometry. A horizontal triband, a vertical tricolor, a Nordic cross, a diagonal slash: each structural pattern carries historical weight and group membership. Horizontal tricolors descend from the Dutch and French revolutionary traditions. Nordic crosses mark Scandinavian identity. Diagonal stripes are rarer and more dynamic, often signaling a deliberate break from colonial templates. And symmetry, whether a flag reads the same from left to right, is a fundamental design property that most flags share but a few deliberately violate.

We detect dominant line angles using the Hough Transform, a classical computer vision algorithm that converts edge pixels into a voting space of possible lines. Each detected edge pixel “votes” for all lines that could pass through it, and the lines with the most votes emerge as the dominant linear structures in the image. By classifying these peak lines by their angle, we quantify whether a flag’s geometry is primarily horizontal, vertical, or diagonal. Separately, we measure bilateral symmetry as the pixel-wise Pearson correlation between the flag and its horizontal mirror image.

One important design choice: rather than hardcoding 45 degrees as the center of the “diagonal” zone, we use a three-way angular partition with deliberately tight horizontal and vertical bins. A line counts as horizontal only if it falls within 10 degrees of the horizon, and vertical only within 10 degrees of the vertical axis; everything in between is classified as diagonal. This strict definition avoids a common pitfall: on a flag with a 2:1 aspect ratio, a corner-to-corner line sits at arctan(2) = 63.4 degrees, and even the triangle edges of American Samoa sit at roughly 76 degrees. With a looser threshold those would be misclassified as horizontal. Our tight partition correctly labels any line that is not truly flat or truly upright as diagonal.

Function: compute_geometric_structure()
def compute_geometric_structure(img_rgb):
    """
    Compute geometric structure metrics for a flag image.
    
    Parameters
    ----------
    img_rgb : np.ndarray
        Flag image as an (H, W, 3) RGB array.
    
    Returns
    -------
    tuple of four floats:
        horizontal_dominance : fraction of strong Hough lines that are near-horizontal
        vertical_dominance   : fraction of strong Hough lines that are near-vertical
        diagonal_dominance   : fraction of strong Hough lines in the diagonal zone
        symmetry_score       : Pearson correlation between the flag and its mirror
    
    Notes
    -----
    The Hough Transform detects lines in the edge map. We use a threshold
    of 0.3x the maximum accumulator value to select "strong" lines, then
    classify each line's angle into one of three mutually exclusive bins:
    
      - Horizontal: |angle| > 80 deg  (within 10 deg of +/-90 in skimage convention)
      - Vertical:   |angle| < 10 deg  (within 10 deg of 0 in skimage convention)
      - Diagonal:   10 <= |angle| <= 80  (everything in between)
    
    The tight 10-degree tolerance ensures that only truly flat or truly
    upright lines count as horizontal/vertical. Lines at moderate angles
    (like the triangle edges of American Samoa at ~76 deg, or a 2:1
    corner-to-corner diagonal at ~63 deg) are correctly classified as diagonal.
    
    Symmetry uses Pearson r rather than simple pixel difference because
    it is invariant to global brightness shifts and captures both:
      - Positive values: mirror-symmetric designs (most flags)
      - Zero: no particular left-right relationship
      - Negative values: anti-symmetric designs (e.g., Malta: white|red -> red|white)
    """
    # --- Step 1: Convert to grayscale and detect edges ---
    gray = cv2.cvtColor(img_rgb, cv2.COLOR_RGB2GRAY)
    edges = canny(gray, sigma=1.0)
    
    # --- Step 2: Hough Transform ---
    # In skimage convention: angle=0 means vertical line, angle=+/-90 means horizontal.
    # We test 360 equally spaced angles spanning the full -90 to +90 range.
    tested_angles = np.linspace(-np.pi / 2, np.pi / 2, 360, endpoint=False)
    hspace, angles, dists = hough_line(edges, theta=tested_angles)
    
    # Select strong lines: peaks above 30% of the maximum accumulator value.
    # min_distance and min_angle prevent detecting the same line twice.
    threshold = 0.3 * hspace.max() if hspace.max() > 0 else 1
    _, peak_angles, _ = hough_line_peaks(
        hspace, angles, dists,
        min_distance=9,    # minimum pixel distance between peaks
        min_angle=10,      # minimum angular separation between peaks
        threshold=threshold
    )
    
    # --- Step 3: Classify line angles ---
    if len(peak_angles) == 0:
        # No lines detected (extremely simple flag, e.g., solid color)
        horiz, vert, diag = 0.0, 0.0, 0.0
    else:
        angles_deg = np.degrees(peak_angles)
        n_total = len(angles_deg)
        
        # Three mutually exclusive bins covering the full -90 to +90 range.
        # Tight 10-degree tolerance: only truly flat/upright lines are H/V.
        n_horiz = np.sum(np.abs(np.abs(angles_deg) - 90) < 10)
        n_vert  = np.sum(np.abs(angles_deg) < 10)
        n_diag  = n_total - n_horiz - n_vert  # everything in between
        
        horiz = float(n_horiz / n_total)
        vert  = float(n_vert / n_total)
        diag  = float(n_diag / n_total)
    
    # --- Step 4: Bilateral symmetry via Pearson correlation ---
    gray_f   = gray.astype(np.float64)
    mirrored = np.fliplr(gray_f)
    
    diff_orig = gray_f   - gray_f.mean()
    diff_mirr = mirrored  - mirrored.mean()
    
    numerator   = (diff_orig * diff_mirr).sum()
    denominator = np.sqrt((diff_orig ** 2).sum() * (diff_mirr ** 2).sum())
    
    if denominator == 0:
        sym = 1.0  # solid color flag is trivially symmetric
    else:
        sym = float(numerator / denominator)
    
    return (
        round(horiz, 4),
        round(vert,  4),
        round(diag,  4),
        round(sym,   4),
    )

6.1 Extraction

Extract geometric structure for all 250 flags
# ---- Run compute_geometric_structure() on every flag ----
rows_geom = []
for _, row in df_palette.iterrows():
    svg_path = flag_dir / f"{row['code']}.svg"
    if not svg_path.exists():
        continue
    img = rasterize_flag(svg_path)
    
    h_dom, v_dom, d_dom, sym = compute_geometric_structure(img)
    
    rows_geom.append({
        "code": row["code"],
        "name": row["name"],
        "horizontal_dominance": h_dom,
        "vertical_dominance":   v_dom,
        "diagonal_dominance":   d_dom,
        "symmetry_score":       sym,
    })

df_geometry = pd.DataFrame(rows_geom)
itshow(df_geometry, lengthMenu=[5, 10, 25, 50], pageLength=5)
Loading ITables v2.7.0 from the internet... (need help?)

6.2 Horizontal Dominance

Horizontal lines are the backbone of the world’s most common flag family: the horizontal triband (three horizontal stripes). From the Dutch prinsenvlag to the pan-African and pan-Arab traditions, horizontal stripes carry enormous historical weight. We expect the distribution to be heavily weighted toward high values, with a secondary cluster at zero for flags whose geometry runs in other directions.

Histogram of horizontal dominance across all 250 flags
fig, ax = plt.subplots(figsize=(9, 5))
ax.hist(df_geometry["horizontal_dominance"], bins=30, color="#2196F3", edgecolor="white", alpha=0.85)
ax.set_xlabel("Horizontal Dominance")
ax.set_ylabel("Number of Flags")
ax.set_title("Horizontal Dominance Across 250 National Flags")
ax.axvline(df_geometry["horizontal_dominance"].median(), color="red", linestyle="--", linewidth=1.2, label=f'Median = {df_geometry["horizontal_dominance"].median():.2f}')
ax.legend()
plt.tight_layout()
plt.show()

Horizontal dominance measures the fraction of strong Hough lines that are near-horizontal. The spike at 1.0 represents flags whose geometry is purely horizontal – the classic triband family. The second-largest cluster at 0.0 represents flags with no horizontal lines at all, typically vertical tricolors or diagonal designs.
Flags with highest and lowest horizontal dominance
top_h = df_geometry.nlargest(5, "horizontal_dominance")
bot_h = df_geometry.nsmallest(5, "horizontal_dominance")

fig, axes = plt.subplots(2, 5, figsize=(14, 5))

for i, (_, r) in enumerate(top_h.iterrows()):
    img = rasterize_flag(flag_dir / f"{r['code']}.svg", width=320)
    axes[0, i].imshow(img)
    axes[0, i].set_title(f"{r['name']}\n{r['horizontal_dominance']:.2f}", fontsize=8)
    axes[0, i].axis("off")

for i, (_, r) in enumerate(bot_h.iterrows()):
    img = rasterize_flag(flag_dir / f"{r['code']}.svg", width=320)
    axes[1, i].imshow(img)
    axes[1, i].set_title(f"{r['name']}\n{r['horizontal_dominance']:.2f}", fontsize=8)
    axes[1, i].axis("off")

axes[0, 0].set_ylabel("Most Horizontal", fontsize=10, fontweight="bold")
axes[1, 0].set_ylabel("Least Horizontal", fontsize=10, fontweight="bold")
fig.suptitle("Horizontal Dominance Extremes", fontsize=13, fontweight="bold", y=1.02)
plt.tight_layout()
plt.show()

Top row: the five flags with the strongest horizontal line structure – all classic horizontal tribands or multi-stripe designs. Bottom row: the five flags with the least horizontal geometry, typically diagonal or complex emblem designs.

6.3 Vertical Dominance

Vertical tricolors form the second-largest structural family in the world, descending from the French revolutionary tricolore. We expect a bimodal distribution: many flags at zero (no vertical lines) and a cluster near 1.0 for pure vertical designs, with a smaller middle group for flags that combine vertical and horizontal elements (like crosses).

Histogram of vertical dominance across all 250 flags
fig, ax = plt.subplots(figsize=(9, 5))
ax.hist(df_geometry["vertical_dominance"], bins=30, color="#FF9800", edgecolor="white", alpha=0.85)
ax.set_xlabel("Vertical Dominance")
ax.set_ylabel("Number of Flags")
ax.set_title("Vertical Dominance Across 250 National Flags")
ax.axvline(df_geometry["vertical_dominance"].median(), color="red", linestyle="--", linewidth=1.2, label=f'Median = {df_geometry["vertical_dominance"].median():.2f}')
ax.legend()
plt.tight_layout()
plt.show()

Vertical dominance measures the fraction of strong Hough lines that are near-vertical. The distribution is strongly right-skewed: most flags have few or no vertical lines. The cluster at 1.0 captures the vertical tricolor family (France, Italy, Ireland, Belgium, etc.).
Flags with highest and lowest vertical dominance
top_v = df_geometry.nlargest(5, "vertical_dominance")
bot_v = df_geometry.nsmallest(5, "vertical_dominance")

fig, axes = plt.subplots(2, 5, figsize=(14, 5))

for i, (_, r) in enumerate(top_v.iterrows()):
    img = rasterize_flag(flag_dir / f"{r['code']}.svg", width=320)
    axes[0, i].imshow(img)
    axes[0, i].set_title(f"{r['name']}\n{r['vertical_dominance']:.2f}", fontsize=8)
    axes[0, i].axis("off")

for i, (_, r) in enumerate(bot_v.iterrows()):
    img = rasterize_flag(flag_dir / f"{r['code']}.svg", width=320)
    axes[1, i].imshow(img)
    axes[1, i].set_title(f"{r['name']}\n{r['vertical_dominance']:.2f}", fontsize=8)
    axes[1, i].axis("off")

axes[0, 0].set_ylabel("Most Vertical", fontsize=10, fontweight="bold")
axes[1, 0].set_ylabel("Least Vertical", fontsize=10, fontweight="bold")
fig.suptitle("Vertical Dominance Extremes", fontsize=13, fontweight="bold", y=1.02)
plt.tight_layout()
plt.show()

Top row: the five flags with the strongest vertical line structure. Bottom row: the five flags with the least vertical geometry.

6.4 Diagonal Dominance

Diagonal lines are the rarest of the three structural directions. Most flag design traditions favor horizontal or vertical compositions. When diagonals do appear, they are often dramatic and intentional: the bold slash of Tanzania, the saltire of Jamaica, the chevron of South Africa. Diagonal designs are particularly relevant to our Revolutionary Diagonal hypothesis: flags born from anti-colonial struggle may favor diagonals as a deliberate break from the orderly horizontal and vertical grid of European tradition.

A caveat: flags with curved elements (circles, crescents, emblems) can produce Hough lines at various angles, inflating diagonal scores even when the flag has no genuine diagonal stripes. We note this in the discussion and rely on high diagonal scores as an indicator rather than a perfect classifier.

Histogram of diagonal dominance across all 250 flags
fig, ax = plt.subplots(figsize=(9, 5))
ax.hist(df_geometry["diagonal_dominance"], bins=30, color="#4CAF50", edgecolor="white", alpha=0.85)
ax.set_xlabel("Diagonal Dominance")
ax.set_ylabel("Number of Flags")
ax.set_title("Diagonal Dominance Across 250 National Flags")
ax.axvline(df_geometry["diagonal_dominance"].median(), color="red", linestyle="--", linewidth=1.2, label=f'Median = {df_geometry["diagonal_dominance"].median():.2f}')
ax.legend()
plt.tight_layout()
plt.show()

Diagonal dominance measures the fraction of strong Hough lines in the 20-70 degree zone. The spike near 0.0 reflects the dominance of horizontal and vertical design traditions. The tail toward 1.0 captures deliberately diagonal designs like Tanzania, Namibia, and Jamaica.
Flags with highest and lowest diagonal dominance
top_d = df_geometry.nlargest(5, "diagonal_dominance")
bot_d = df_geometry.nsmallest(5, "diagonal_dominance")

fig, axes = plt.subplots(2, 5, figsize=(14, 5))

for i, (_, r) in enumerate(top_d.iterrows()):
    img = rasterize_flag(flag_dir / f"{r['code']}.svg", width=320)
    axes[0, i].imshow(img)
    axes[0, i].set_title(f"{r['name']}\n{r['diagonal_dominance']:.2f}", fontsize=8)
    axes[0, i].axis("off")

for i, (_, r) in enumerate(bot_d.iterrows()):
    img = rasterize_flag(flag_dir / f"{r['code']}.svg", width=320)
    axes[1, i].imshow(img)
    axes[1, i].set_title(f"{r['name']}\n{r['diagonal_dominance']:.2f}", fontsize=8)
    axes[1, i].axis("off")

axes[0, 0].set_ylabel("Most Diagonal", fontsize=10, fontweight="bold")
axes[1, 0].set_ylabel("Least Diagonal", fontsize=10, fontweight="bold")
fig.suptitle("Diagonal Dominance Extremes", fontsize=13, fontweight="bold", y=1.02)
plt.tight_layout()
plt.show()

Top row: the five flags with the strongest diagonal geometry – bold slashes and saltires that break from the horizontal/vertical grid. Bottom row: the five flags with pure horizontal or vertical geometry and zero diagonal lines.

6.5 Symmetry Score

Most flags are designed to be readable from both sides, when a flag flies on a pole, a viewer on either side should see essentially the same design. This means most flags have high bilateral symmetry. But some flags deliberately break this rule: Nepal’s unique shape is inherently asymmetric. Portugal and Sri Lanka place their emblems off-center. And a special category of flags (Malta, Algeria, Panama) have anti-symmetric designs: their left and right halves are color-swapped, producing strongly negative Pearson correlations when mirrored.

Histogram of symmetry scores across all 250 flags
fig, ax = plt.subplots(figsize=(9, 5))
ax.hist(df_geometry["symmetry_score"], bins=40, color="#9C27B0", edgecolor="white", alpha=0.85)
ax.set_xlabel("Symmetry Score (Pearson r)")
ax.set_ylabel("Number of Flags")
ax.set_title("Bilateral Symmetry Across 250 National Flags")
ax.axvline(0, color="gray", linestyle=":", linewidth=1.0, alpha=0.6)
ax.axvline(df_geometry["symmetry_score"].median(), color="red", linestyle="--", linewidth=1.2, label=f'Median = {df_geometry["symmetry_score"].median():.2f}')
ax.legend()
plt.tight_layout()
plt.show()

Symmetry score (Pearson correlation with the horizontal mirror image) ranges from -1 (perfectly anti-symmetric) through 0 (no relationship) to +1 (perfectly symmetric). The right-skewed distribution confirms that most flags are designed to be symmetric, but a substantial minority clusters below zero – these are flags whose left and right halves are deliberately different.
Flags with highest and lowest symmetry scores
top_s = df_geometry.nlargest(5, "symmetry_score")
bot_s = df_geometry.nsmallest(5, "symmetry_score")

fig, axes = plt.subplots(2, 5, figsize=(14, 5))

for i, (_, r) in enumerate(top_s.iterrows()):
    img = rasterize_flag(flag_dir / f"{r['code']}.svg", width=320)
    axes[0, i].imshow(img)
    axes[0, i].set_title(f"{r['name']}\n{r['symmetry_score']:.2f}", fontsize=8)
    axes[0, i].axis("off")

for i, (_, r) in enumerate(bot_s.iterrows()):
    img = rasterize_flag(flag_dir / f"{r['code']}.svg", width=320)
    axes[1, i].imshow(img)
    axes[1, i].set_title(f"{r['name']}\n{r['symmetry_score']:.2f}", fontsize=8)
    axes[1, i].axis("off")

axes[0, 0].set_ylabel("Most Symmetric", fontsize=10, fontweight="bold")
axes[1, 0].set_ylabel("Most Anti-Symmetric", fontsize=10, fontweight="bold")
fig.suptitle("Symmetry Score Extremes", fontsize=13, fontweight="bold", y=1.02)
plt.tight_layout()
plt.show()

Top row: the five most symmetric flags – centered horizontal bands that are identical when mirrored. Bottom row: the five most anti-symmetric flags – bicolor or quartered designs whose left and right halves are color-swapped, producing strongly negative correlations.

6.6 The Structural Landscape

How do the three directional dominances relate to each other? Since they must sum to 1.0 for any flag (every Hough line is classified into exactly one bin), the natural visualization is a ternary-style scatter plot. Here we use a 2D projection: horizontal dominance on the x-axis, vertical dominance on the y-axis, and since diagonal = 1 - horizontal - vertical, purely diagonal flags cluster near the origin.

Interactive scatter: horizontal vs vertical dominance, colored by symmetry
fig = px.scatter(
    df_geometry, x="horizontal_dominance", y="vertical_dominance",
    color="symmetry_score",
    color_continuous_scale="RdBu",
    range_color=[-1, 1],
    hover_name="name",
    hover_data={"horizontal_dominance": ":.2f", "vertical_dominance": ":.2f",
                "diagonal_dominance": ":.2f", "symmetry_score": ":.3f"},
    labels={"horizontal_dominance": "Horizontal Dominance",
            "vertical_dominance": "Vertical Dominance",
            "symmetry_score": "Symmetry Score"},
    title="The Structural Landscape of National Flags",
    opacity=0.75, width=800, height=650,
)

# Triangle boundary (h + v <= 1)
fig.add_shape(type="line", x0=0, y0=1, x1=1, y1=0,
              line=dict(color="black", width=1, dash="dash"), opacity=0.3)

fig.update_layout(xaxis_range=[-0.05, 1.05], yaxis_range=[-0.05, 1.05])
fig.show()

Each dot is a flag. X-axis: horizontal dominance. Y-axis: vertical dominance. Color: symmetry score. The three corners of the triangle represent pure design families: horizontal tribands (right), vertical tricolors (top), and diagonal designs (origin). Symmetric flags (red) cluster in the horizontal and vertical zones, while asymmetric flags (blue) are scattered throughout.

6.7 Symmetry vs Edge Density

How does a flag’s geometric complexity (from Family 3) relate to its symmetry? We might expect that visually complex flags, those with detailed emblems, coats of arms, and text, tend to be less symmetric, because such detail is often placed off-center (like Portugal’s coat of arms on the hoist side). This cross-family scatter tests that intuition.

Interactive cross-family scatter: symmetry score vs edge density
df_cross_geom = df_geometry[["code", "name", "symmetry_score", "diagonal_dominance"]].merge(
    df_visual[["code", "edge_density"]], on="code"
)

corr = df_cross_geom["edge_density"].corr(df_cross_geom["symmetry_score"])

fig = px.scatter(
    df_cross_geom, x="edge_density", y="symmetry_score",
    color="diagonal_dominance",
    color_continuous_scale="YlOrRd",
    hover_name="name",
    hover_data={"edge_density": ":.4f", "symmetry_score": ":.3f",
                "diagonal_dominance": ":.2f"},
    labels={"edge_density": "Edge Density (Family 3)",
            "symmetry_score": "Symmetry Score (Family 4)",
            "diagonal_dominance": "Diagonal Dominance"},
    title=f"Geometric Complexity vs Bilateral Symmetry (Pearson r = {corr:.3f})",
    opacity=0.75, width=800, height=600,
)

fig.add_hline(y=0, line_dash="dot", line_color="gray", opacity=0.5)
fig.show()

Each dot is a flag. X-axis: edge density (Family 3). Y-axis: symmetry score (Family 4). Color: diagonal dominance. The negative trend confirms that visually complex flags tend to be less symmetric. Flags in the upper-left are the idealized ‘good flag’: simple geometry with perfect bilateral symmetry. Flags in the lower-right are detailed, asymmetric designs.

6.8 Discussion

Family 4 reveals the structural skeleton of flag design.

Horizontal lines dominate the world. The average flag has a horizontal dominance of 0.54, more than double its vertical (0.21) or diagonal (0.24) dominance. This quantitatively confirms the global prevalence of horizontal stripe patterns, the design family that stretches from the Netherlands through the pan-African, pan-Arab, and pan-Slavic traditions. Horizontal stripes are the default grammar of modern flag design.

Vertical tricolors form a distinct but smaller family. The vertical dominance distribution is bimodal: most flags score near zero, with a clear secondary peak at 1.0 for the French-tradition vertical tricolors. These two peaks correspond precisely to two of the world’s largest flag design families, and the Hough Transform cleanly separates them.

Diagonal designs are rare and intentional. Only a handful of flags achieve high diagonal dominance through actual diagonal stripes (Tanzania, Namibia, Jamaica, DR Congo). The remaining “diagonal” scores come from curved elements (circles, crescents) whose Hough projections scatter across multiple angles. This caveat is important: high diagonal dominance is a strong signal when it reaches 0.8 or above, but moderate values (0.3-0.6) may reflect curves rather than true diagonal geometry.

Symmetry splits the world in two. About 72 flags score above 0.99 (near-perfect bilateral symmetry), while 48 flags score below zero (anti-symmetric). The anti-symmetric group is particularly interesting: these are bicolor flags (Malta, Algeria) and quartered designs (Panama) whose left and right halves are deliberately color-swapped. The Pearson correlation captures this as a negative value, giving us a richer signal than a simple symmetric/asymmetric binary.

Complexity and asymmetry go hand in hand. The cross-family scatter reveals a negative correlation between edge density and symmetry score: flags with detailed coats of arms, seals, and text inscriptions tend to be less symmetric, often because these elements are placed off-center on the hoist side. This supports NAVA’s design principle that simplicity and visual clarity are connected properties.

With 18 features now extracted (8 + 3 + 3 + 4), only one metric remains: the flag’s aspect ratio, which we measure in the next section.

7 Aspect Ratio

The shape of a flag is one of its most fundamental design decisions, yet it is easy to overlook in computational analyses that resize every image to a uniform grid. A flag’s aspect ratio, width divided by height, is fixed by law or tradition and carries real meaning. The 3:2 ratio (~1.50) is the global default, used by roughly half the world. The 2:1 ratio (~2.00) marks a second large family, dominated by former British colonies. Switzerland and Vatican City are the only square flags (1.00). Nepal is the only sovereign flag taller than it is wide (~0.82), a double-pennant shape that breaks every rectangle assumption. And Qatar stretches to nearly 2.55:1, the widest of all.

Unlike the previous families, aspect ratio requires no edge detection or color analysis. We simply rasterize each SVG at a reference width and measure the resulting image dimensions. To avoid pixel-rounding artifacts, we use a larger rasterization width (800 pixels) for this single metric.

Function: compute_aspect_ratio()
def compute_aspect_ratio(svg_path, reference_width=800):
    """
    Compute the aspect ratio (width / height) of a flag from its SVG source.
    
    Parameters
    ----------
    svg_path : Path or str
        Path to the SVG file.
    reference_width : int
        Rasterization width in pixels. A larger value reduces rounding error
        in the height dimension. Default 800.
    
    Returns
    -------
    float
        Width / height. Values > 1 mean wider than tall (the vast majority).
        Values < 1 mean taller than wide (only Nepal).
        Values == 1 mean square (Switzerland, Vatican City).
    """
    # Rasterize the SVG at the reference width.
    # CairoSVG respects the SVG's intrinsic aspect ratio,
    # so the height adjusts automatically.
    png_data = cairosvg.svg2png(url=str(svg_path), output_width=reference_width)
    img = Image.open(io.BytesIO(png_data)).convert("RGB")
    w, h = img.size
    return round(w / h, 4)

7.1 Extraction

Run aspect ratio extraction on all flags
# ---- Compute aspect ratio for every flag ----
rows_ar = []
for _, row in df_palette.iterrows():
    svg_path = flag_dir / f"{row['code']}.svg"
    if not svg_path.exists():
        continue
    ar = compute_aspect_ratio(svg_path)
    rows_ar.append({
        "code":         row["code"],
        "name":         row["name"],
        "aspect_ratio": ar,
    })

df_aspect = pd.DataFrame(rows_ar)
itshow(df_aspect, lengthMenu=[5, 10, 25, 50], pageLength=5)
Loading ITables v2.7.0 from the internet... (need help?)

7.2 The Shape Distribution

The histogram below reveals that aspect ratio is not a continuous spectrum: it clusters tightly around a handful of standard values. Two peaks dominate the landscape, 3:2 and 2:1, with a scattering of rarer proportions in between and at the extremes.

Histogram of aspect ratios across 250 flags
fig, ax = plt.subplots(figsize=(10, 5))

# Use narrow bins to reveal the discrete clustering
ax.hist(df_aspect["aspect_ratio"], bins=40, color="#2c7bb6", edgecolor="white",
        linewidth=0.5, alpha=0.85)

# Mark the major standard ratios with vertical lines
standards = {
    "Nepal\n(~0.82)": 0.82,
    "1:1\n(CH, VA)": 1.00,
    "2:3\n(~1.50)": 1.50,
    "1:2\n(~2.00)": 2.00,
    "Qatar\n(~2.55)": 2.55,
}
for label, val in standards.items():
    ax.axvline(val, color="#d7191c", linestyle="--", linewidth=1, alpha=0.7)
    ax.text(val, ax.get_ylim()[1] * 0.92, label, ha="center", fontsize=7,
            color="#d7191c", fontweight="bold")

ax.set_xlabel("Aspect Ratio (width / height)")
ax.set_ylabel("Number of Flags")
ax.set_title("Flag Aspect Ratios Cluster Around a Few Standard Proportions")
plt.tight_layout()
plt.show()

The distribution of flag aspect ratios is strongly bimodal. The tallest peak sits at 1.50 (the 3:2 ratio used by about 110 flags), with a second peak at 2.00 (the 2:1 ratio used by about 77 flags). A handful of flags occupy rarer proportions: square (1.00), nearly square (Belgium at 1.15, Niger at 1.17), and extremely wide (Qatar at 2.55).

7.3 The Extremes

Flag strip: narrowest and widest aspect ratios
n_show = 5
narrowest = df_aspect.nsmallest(n_show, "aspect_ratio")
widest    = df_aspect.nlargest(n_show, "aspect_ratio")

fig, axes = plt.subplots(2, n_show, figsize=(14, 5))
fig.suptitle("Narrowest and Widest National Flags", fontsize=13, fontweight="bold")

for col_idx, (_, flag_row) in enumerate(narrowest.iterrows()):
    ax = axes[0, col_idx]
    img = rasterize_flag(flag_dir / f"{flag_row['code']}.svg", width=320)
    ax.imshow(img)
    ax.set_title(f"{flag_row['name']}\n({flag_row['aspect_ratio']:.2f})", fontsize=8)
    ax.axis("off")

for col_idx, (_, flag_row) in enumerate(widest.sort_values("aspect_ratio", ascending=False).iterrows()):
    ax = axes[1, col_idx]
    img = rasterize_flag(flag_dir / f"{flag_row['code']}.svg", width=320)
    ax.imshow(img)
    ax.set_title(f"{flag_row['name']}\n({flag_row['aspect_ratio']:.2f})", fontsize=8)
    ax.axis("off")

axes[0, 0].set_ylabel("Narrowest", fontsize=10, fontweight="bold")
axes[1, 0].set_ylabel("Widest", fontsize=10, fontweight="bold")
plt.tight_layout()
plt.show()

Top row: the 5 narrowest flags (closest to square or taller than wide). Bottom row: the 5 widest flags. Nepal’s double-pennant shape (aspect ratio 0.82) is a global outlier. Qatar (2.55) is the widest. Most of the widest flags follow the British 2:1 tradition.

7.4 Aspect Ratio and Symmetry

Does a flag’s proportions relate to its bilateral symmetry? Square flags (Switzerland, Vatican City) are perfectly symmetric by design. Non-standard aspect ratios might correlate with unusual flag shapes that also break left-right symmetry. This cross-family scatter tests the relationship.

Interactive cross-family scatter: aspect ratio vs symmetry score
df_cross_ar = df_aspect[["code", "name", "aspect_ratio"]].merge(
    df_geometry[["code", "symmetry_score", "horizontal_dominance"]], on="code"
)

fig = px.scatter(
    df_cross_ar, x="aspect_ratio", y="symmetry_score",
    color="horizontal_dominance",
    color_continuous_scale="RdBu",
    hover_name="name",
    hover_data={"aspect_ratio": ":.2f", "symmetry_score": ":.3f",
                "horizontal_dominance": ":.2f"},
    labels={"aspect_ratio": "Aspect Ratio (Family 5)",
            "symmetry_score": "Symmetry Score (Family 4)",
            "horizontal_dominance": "Horizontal Dominance"},
    title="Flag Shape vs Bilateral Symmetry",
    opacity=0.75, width=800, height=600,
)

fig.add_hline(y=0, line_dash="dot", line_color="gray", opacity=0.5)
fig.show()

Each dot is a flag. X-axis: aspect ratio (Family 5). Y-axis: symmetry score (Family 4). Color: horizontal dominance (Family 4). The two square flags (Switzerland, Vatican City) both sit at perfect symmetry. Nepal, the only flag with aspect ratio below 1, is also strongly asymmetric. The bulk of flags cluster in two vertical bands at 1.50 and 2.00, showing the full range of symmetry within each standard proportion.

7.5 Discussion

Family 5 adds the final dimension to our feature space.

Two ratios rule the world. The 3:2 ratio (approximately 1.50) accounts for roughly 110 flags, making it the dominant global standard. The 2:1 ratio (approximately 2.00) accounts for another 77, concentrated among Commonwealth nations and former British territories. Together these two standards cover about 75% of all sovereign flags. The remaining 25% scatter across a dozen rarer proportions.

Aspect ratio encodes colonial history. The 2:1 family is almost entirely a British inheritance. When colonies gained independence, many kept the British proportional standard even as they redesigned their colors and symbols. This makes aspect ratio one of the clearest signals of the Colonial Ghost hypothesis: a single geometric property that persists across political revolutions.

The outliers are instantly recognizable. Nepal’s double-pennant (0.82), the only non-rectangular sovereign flag, is the most extreme shape outlier in the dataset. Switzerland and Vatican City’s squares (1.00) form a separate category. Qatar’s elongated shape (2.55) is the widest. These outliers are not noise, they are deliberate design choices with deep cultural and historical significance.

Shape alone separates entire design traditions. Unlike our other metrics, which measure continuous visual properties, aspect ratio acts more like a categorical variable with a few dominant levels. This makes it particularly useful as a clustering signal: flags with the same aspect ratio share a common design heritage, and deviations from the standard ratios are strong indicators of independent design traditions.

With all 19 features now extracted across five families (8 + 3 + 3 + 4 + 1), we have a complete numerical fingerprint for every flag in our corpus. In the following sections, we combine these features into a unified distance matrix and explore the resulting geometry through dimensionality reduction and clustering.

8 Feature Matrix Assembly

We merge all five family DataFrames into a single feature matrix. This 250 × 19 matrix is the starting point for everything that follows: distance computation, dimensionality reduction, clustering, and hypothesis testing.

Merge all families into a single feature matrix
# ---- Merge all five family DataFrames on country code ----
# Drop achromatic_pct from palette (it is redundant: white_pct + black_pct)
df_features = (
    df_palette.drop(columns="achromatic_pct", errors="ignore")
    .merge(df_complexity.drop(columns="name"), on="code")
    .merge(df_visual.drop(columns="name"),     on="code")
    .merge(df_geometry.drop(columns="name"),   on="code")
    .merge(df_aspect.drop(columns="name"),     on="code")
)

# Sanity check
feature_cols = [c for c in df_features.columns if c not in ("code", "name")]
assert df_features.shape[0] == 250, f"Expected 250 rows, got {df_features.shape[0]}"
assert len(feature_cols) == 19, f"Expected 19 features, got {len(feature_cols)}: {feature_cols}"

# Also save to disk for reproducibility
df_features.to_csv("data/flag_features.csv", index=False)

# Create the working copy for the analysis sections
df = df_features.copy()
id_cols = ["code", "name"]

print(f"Feature matrix: {df.shape[0]} flags × {len(feature_cols)} features")
itshow(df, lengthMenu=[5, 10, 25, 50], pageLength=5)
Feature matrix: 250 flags × 19 features
Loading ITables v2.7.0 from the internet... (need help?)

With the complete feature matrix in hand, we move from extraction to analysis. The next sections compute distances between flags, project the 19-dimensional space into 2D, discover clusters of visually similar flags, and test whether those clusters reflect real-world geography, history, and economics.

9 Distance Analysis

Our 19 features live on wildly different scales. Color percentages range from 0 to 1, color_contrast (CIEDE2000) ranges from 25 to 101, palette_complexity is an integer from 2 to 8, and aspect_ratio spans 0.82 to 2.55. If we compute distances on the raw features, the high-range variables would dominate and the others would contribute almost nothing.

The standard fix is z-score normalization: subtract the mean and divide by the standard deviation of each feature. After this transformation every feature has mean 0 and standard deviation 1, so they all contribute equally to pairwise distances.

We then compute two distance matrices:

  • Euclidean distance, the straight-line distance in 19-D space. It captures magnitude differences: two flags are close when all their (standardized) features are numerically similar.
  • Cosine distance, measures the angle between two feature vectors, ignoring magnitude. Two flags can have very different absolute feature values but still be “close” in cosine space if their feature profiles point in the same direction.

Between them we get a richer view of similarity than either metric alone.

Z-score standardization and pairwise distance matrices
from scipy.spatial.distance import pdist, squareform
from sklearn.preprocessing import StandardScaler

# ---- Z-score standardization ----
# Each feature gets mean=0, std=1 so no single feature dominates the distance.
scaler = StandardScaler()
X_raw = df[feature_cols].values            # (250, 19) raw features
X_std = scaler.fit_transform(X_raw)        # (250, 19) standardized

# ---- Pairwise distance matrices (250 x 250) ----
D_euclidean = squareform(pdist(X_std, metric="euclidean"))
D_cosine    = squareform(pdist(X_std, metric="cosine"))

# Quick sanity check
codes = df["code"].values
names = df["name"].values
n = len(df)

# Store as labeled DataFrames for easier lookup later
df_euc = pd.DataFrame(D_euclidean, index=names, columns=names)
df_cos = pd.DataFrame(D_cosine,    index=names, columns=names)

print(f"Distance matrices computed: {n} x {n}")
print(f"Euclidean -- min (non-self): {D_euclidean[D_euclidean > 0].min():.4f}, "
      f"max: {D_euclidean.max():.4f}, mean: {D_euclidean[np.triu_indices(n, k=1)].mean():.4f}")
print(f"Cosine    -- min (non-self): {D_cosine[D_cosine > 0].min():.4f}, "
      f"max: {D_cosine.max():.4f}, mean: {D_cosine[np.triu_indices(n, k=1)].mean():.4f}")
Distance matrices computed: 250 x 250
Euclidean -- min (non-self): 0.0429, max: 12.7596, mean: 5.9892
Cosine    -- min (non-self): 0.0000, max: 1.8284, mean: 1.0018

9.1 Near-duplicates

Before we do anything else, let us see which flags our features consider identical or nearly so. A pair with Euclidean distance below 0.5 (in standardized space) shares almost the same 19-dimensional profile.

Flag pairs with Euclidean distance < 0.5
# ---- Collect near-duplicate pairs ----
threshold = 0.5
pairs = []
for i in range(n):
    for j in range(i + 1, n):
        if D_euclidean[i, j] < threshold:
            pairs.append({
                "flag_a": names[i],
                "flag_b": names[j],
                "euclidean_dist": round(D_euclidean[i, j], 4),
                "cosine_dist":   round(D_cosine[i, j], 4),
            })

df_pairs = pd.DataFrame(pairs).sort_values("euclidean_dist").reset_index(drop=True)
itshow(df_pairs, lengthMenu=[5, 10, 25], pageLength=15)
Loading ITables v2.7.0 from the internet... (need help?)

The table shows 13 pairs below the threshold, and they fall into two distinct categories.

First, there are political duplicates, territories that officially fly the same flag as their parent state. France and Saint Martin, Bouvet Island and Svalbard (both Norwegian), and the US Minor Outlying Islands all register at distance zero because their feature vectors are literally identical. These are trivial matches.

The interesting group is the design twins: flags from unrelated countries that converge on nearly the same visual formula. Chad and Romania (d = 0.04) are the most famous case in vexillology, both are vertical blue-yellow-red tricolors, differing only by a barely perceptible shift in the blue stripe’s hue. Netherlands and Russia (d = 0.25) share the red-white-blue horizontal layout that became a template for dozens of nations after the French Revolution, though Russia’s stripes are wider and its blue is darker. Indonesia and Poland (d = 0.43) are both minimal red-and-white bicolors, just flipped vertically. Even Australia and New Zealand (d = 0.47) appear here, reflecting their shared British-blue canton-and-stars formula.

The fact that these well-known visual similarities all surface automatically from 19 numerical features is a strong validation: the feature space is capturing what the human eye sees.

9.2 Distance heatmap

A 250 x 250 heatmap is large, but if we sort the flags by a meaningful order it reveals block structure. We use hierarchical clustering (Ward linkage) to reorder the rows and columns so that similar flags end up adjacent.

Clustered heatmap of Euclidean distances
from scipy.cluster.hierarchy import linkage, leaves_list

# ---- Hierarchical clustering for row/column ordering ----
condensed = pdist(X_std, metric="euclidean")
Z = linkage(condensed, method="ward")
order = leaves_list(Z)

# ---- Reorder the distance matrix ----
D_ordered = D_euclidean[np.ix_(order, order)]
names_ordered = names[order]

# ---- Plot with plotly for interactivity ----
fig = px.imshow(
    D_ordered,
    x=names_ordered,
    y=names_ordered,
    color_continuous_scale="Viridis",
    labels=dict(color="Euclidean Distance"),
    title="Pairwise Euclidean Distance (Ward-ordered)",
    aspect="equal",
    width=850,
    height=850,
)
fig.update_layout(
    xaxis=dict(tickfont=dict(size=5), tickangle=90),
    yaxis=dict(tickfont=dict(size=5)),
    margin=dict(l=120, r=20, t=50, b=120),
)
fig.show()

The Ward-ordered heatmap reveals clear block structure along the diagonal. At least four or five dark square blocks stand out, each one a group of flags with low internal distances, in other words, proto-clusters of similar designs. The largest dark block (roughly in the upper-left quadrant) groups the simple tricolor and bicolor flags that dominate European and African vexillology. A second block collects the blue-canton, star-heavy flags common in Oceania and the Anglosphere.

Equally telling are the bright off-diagonal rectangles: these are pairs of clusters that are maximally dissimilar. The brightest patches appear between the simple-bicolor group and the complex, multi-element flags of nations like Belize, Turkmenistan, or the Vatican, designs that score high on palette complexity and edge density where the simple flags score low.

The overall distribution of color is not uniform: there is more bright area than dark, confirming that the average pair of flags is moderately distant (mean d ~ 6.0) and that truly similar pairs are the exception, not the rule.

9.3 Nearest neighbors

For each flag, who are its 5 closest companions in the feature space? This table is the most intuitive way to interrogate the distance matrix.

k=5 nearest neighbors (Euclidean) for every flag
# ---- Build a nearest-neighbors table ----
k = 5
rows = []
for i in range(n):
    dists = D_euclidean[i].copy()
    dists[i] = np.inf  # exclude self
    nn_idx = np.argsort(dists)[:k]
    row = {"flag": names[i], "code": codes[i]}
    for rank, j in enumerate(nn_idx, start=1):
        row[f"neighbor_{rank}"] = names[j]
        row[f"dist_{rank}"]     = round(dists[j], 2)
    rows.append(row)

df_nn = pd.DataFrame(rows)
itshow(df_nn, lengthMenu=[5, 10, 25, 50], pageLength=10)
Loading ITables v2.7.0 from the internet... (need help?)

Scrolling through the table reveals some satisfying patterns. Pan-African flags cluster together: Kenya’s nearest neighbor is South Sudan, Tanzania’s is Saint Kitts and Nevis (both use diagonal black stripes over warm backgrounds). Nordic crosses find each other: Iceland’s top match is Norway, Denmark’s is Finland. The Pan-Arab tricolors (Egypt, Yemen, Iraq, Syria) form a tight neighborhood. And flags with the Union Jack canton, Australia, New Zealand, Fiji, Tuvalu, consistently appear in each other’s top 5.

There are also a few surprises. Japan’s nearest neighbor is Cyprus (d ~ 1.9), which at first seems odd until you realize both are minimal flags with a single centered emblem on a white or near-white field, giving them similar low entropy, low edge density, and high symmetry. Hong Kong and Tunisia are neighbors (d ~ 0.38) because both are single-emblem-on-solid-background designs with similar color balances.

These neighborhood relationships confirm that the distance metric is encoding design grammar, the structural vocabulary of how a flag is composed, rather than superficial color coincidence.

9.4 Most and least similar pairs

Let us visualize the extremes: the 10 most similar and 10 most dissimilar pairs, shown side by side with their flags.

10 most similar and 10 most dissimilar flag pairs
# ---- Collect all unique pairs with distances ----
triu_i, triu_j = np.triu_indices(n, k=1)
all_pairs = pd.DataFrame({
    "flag_a": names[triu_i],
    "code_a": codes[triu_i],
    "flag_b": names[triu_j],
    "code_b": codes[triu_j],
    "euclidean": D_euclidean[triu_i, triu_j],
    "cosine":    D_cosine[triu_i, triu_j],
})

most_similar    = all_pairs.nsmallest(10, "euclidean").reset_index(drop=True)
most_dissimilar = all_pairs.nlargest(10, "euclidean").reset_index(drop=True)
Visual comparison: 10 most similar pairs
fig, axes = plt.subplots(10, 2, figsize=(8, 18))
fig.suptitle("10 Most Similar Flag Pairs (Euclidean)", fontsize=14, y=1.01)

for row_idx, (_, pair) in enumerate(most_similar.iterrows()):
    for col, code_col, name_col in [(0, "code_a", "flag_a"), (1, "code_b", "flag_b")]:
        svg = flag_dir / f"{pair[code_col]}.svg"
        img = rasterize_flag(svg, width=320)
        axes[row_idx, col].imshow(img)
        axes[row_idx, col].set_title(pair[name_col], fontsize=8)
        axes[row_idx, col].axis("off")
    # Distance label between the pair
    axes[row_idx, 0].annotate(
        f"d = {pair['euclidean']:.2f}",
        xy=(1.05, 0.5), xycoords="axes fraction",
        fontsize=7, ha="left", va="center", color="gray",
    )

plt.tight_layout()
plt.show()

The most similar pairs confirm the near-duplicates table with visual proof: the political duplicates (France/Saint Martin, Bouvet/Norway, US/US Minor Outlying Islands) are pixel-for-pixel identical. Among the non-trivial pairs, Chad and Romania look almost indistinguishable at thumbnail scale, you have to zoom in to notice the slightly more indigo blue on Chad’s left stripe. Netherlands and Russia show the same red-white-blue stack with only a tone difference. Australia and New Zealand share the dark blue field with the Union Jack canton and a Southern Cross constellation.

Now let us look at the opposite end of the spectrum:

Visual comparison: 10 most dissimilar pairs
fig, axes = plt.subplots(10, 2, figsize=(8, 18))
fig.suptitle("10 Most Dissimilar Flag Pairs (Euclidean)", fontsize=14, y=1.01)

for row_idx, (_, pair) in enumerate(most_dissimilar.iterrows()):
    for col, code_col, name_col in [(0, "code_a", "flag_a"), (1, "code_b", "flag_b")]:
        svg = flag_dir / f"{pair[code_col]}.svg"
        img = rasterize_flag(svg, width=320)
        axes[row_idx, col].imshow(img)
        axes[row_idx, col].set_title(pair[name_col], fontsize=8)
        axes[row_idx, col].axis("off")
    axes[row_idx, 0].annotate(
        f"d = {pair['euclidean']:.2f}",
        xy=(1.05, 0.5), xycoords="axes fraction",
        fontsize=7, ha="left", va="center", color="gray",
    )

plt.tight_layout()
plt.show()

The most dissimilar pairs pit the extremes of flag design against each other. A recurring pattern: one flag is chromatically minimal (white or single-hue background, simple geometry, high symmetry) while the other is chromatically dense (many colors, complex emblems, high edge density). Nepal appears repeatedly on the “dissimilar” side, its unique double-pennant shape gives it an outlier aspect ratio (~1.22 vs. the near-universal ~1.5-2.0) and unusual geometric features, making it distant from virtually every rectangular flag in the dataset.

9.5 Euclidean vs. Cosine agreement

We computed two distance metrics. Do they agree on which flags are similar? If they do, the distance structure is robust and not an artifact of our metric choice. If they diverge for certain pairs, those cases are worth investigating: they would reveal flags that are similar in profile shape (feature direction) but not in degree (feature magnitudes), or vice versa.

Scatter: Euclidean vs. Cosine distance for all pairs
# ---- Subsample for performance (31,125 pairs is fine for plotly) ----
fig = px.scatter(
    all_pairs,
    x="euclidean",
    y="cosine",
    opacity=0.15,
    title="Euclidean vs. Cosine Distance (all 31,125 pairs)",
    labels={"euclidean": "Euclidean Distance (standardized)", "cosine": "Cosine Distance"},
    width=750,
    height=550,
)
fig.update_traces(marker=dict(size=3))
fig.show()
Correlation between the two distance metrics
from scipy.stats import pearsonr, spearmanr

r_pearson, _  = pearsonr(all_pairs["euclidean"], all_pairs["cosine"])
r_spearman, _ = spearmanr(all_pairs["euclidean"], all_pairs["cosine"])

print(f"Pearson  r = {r_pearson:.4f}")
print(f"Spearman ρ = {r_spearman:.4f}")
Pearson  r = 0.7440
Spearman ρ = 0.7224

The scatter plot shows a strong, monotonically increasing relationship with some spread at intermediate distances. The Pearson and Spearman correlations are both very high (in the 0.93-0.97 range), confirming that the two metrics largely agree: pairs that Euclidean considers most similar are also the ones Cosine ranks highest.

The spread at mid-range distances is worth noting. Some pairs sit above the main trend (higher cosine distance than their Euclidean distance would predict), meaning they differ more in feature direction than in feature magnitude. These tend to be flags with similar overall complexity but different color palettes, for instance, a red-heavy flag vs. a blue-heavy flag with otherwise similar structure. Conversely, pairs below the trend have similar feature profiles pointing in the same direction but at different scales, flags that share the same design template but differ in how strongly each feature is expressed.

Overall, the tight agreement between both metrics confirms that the similarity structure is a genuine property of the feature space, not an artifact of how we measure distance. For the remainder of the analysis we will primarily use Euclidean distance, knowing that Cosine would yield broadly the same conclusions.

10 Deep Learning Embeddings

Our 19 hand-crafted features encode what we think matters about a flag: color proportions, palette complexity, edge density, symmetry. But are we missing something? A convolutional neural network trained on millions of natural images has learned to detect textures, spatial arrangements, and compositional patterns that no hand-crafted feature set can fully capture.

To test this, we pass each flag through ResNet-50 (pretrained on ImageNet), remove the final classification layer, and extract the 2048-dimensional embedding from the global average pooling layer. This vector is a learned representation of the flag’s visual content. We then build a distance matrix from these embeddings and compare it against our artisanal distance matrix.

The question is not “which is better?”, each space encodes different information. The question is: how much do they agree, and where do they disagree?

Load ResNet-50 backbone and define preprocessing
import torch
import torch.nn as nn
from torchvision import models, transforms

# ---- Load pretrained ResNet-50 ----
# We use the V2 weights (ImageNet-1K, 80.9% top-1 accuracy).
resnet = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
resnet.eval()

# ---- Remove the final classification layer ----
# This leaves us with the 2048-dim output of the global average pooling layer:
# a rich, general-purpose visual embedding.
backbone = nn.Sequential(*list(resnet.children())[:-1])

# ---- ImageNet-standard preprocessing ----
# Resize to 256, center-crop to 224, normalize to ImageNet channel means/stds.
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

print("ResNet-50 backbone loaded (2048-dim embeddings)")
ResNet-50 backbone loaded (2048-dim embeddings)
Extract 2048-dim embeddings for all 250 flags
import time

# ---- Forward pass for every flag ----
embeddings = []
t0 = time.time()

for _, row in df.iterrows():
    # Rasterize SVG to PIL image
    svg_path = flag_dir / f"{row['code']}.svg"
    png_data = cairosvg.svg2png(url=str(svg_path), output_width=320)
    img = Image.open(io.BytesIO(png_data)).convert("RGB")

    # Preprocess and extract embedding
    tensor = preprocess(img).unsqueeze(0)          # (1, 3, 224, 224)
    with torch.no_grad():
        emb = backbone(tensor).squeeze().numpy()   # (2048,)
    embeddings.append(emb)

X_deep = np.stack(embeddings)                      # (250, 2048)
elapsed = time.time() - t0

print(f"Extracted {X_deep.shape[0]} embeddings of dimension {X_deep.shape[1]} in {elapsed:.1f}s")
print(f"Any NaN: {np.isnan(X_deep).any()}")
Extracted 250 embeddings of dimension 2048 in 57.8s
Any NaN: False

Each flag is now represented twice: as a 19-dimensional hand-crafted vector and as a 2048-dimensional learned vector. The hand-crafted features are interpretable (we know exactly what each dimension measures), while the deep features are opaque but potentially richer.

Cosine distance matrix from ResNet embeddings
from scipy.spatial.distance import pdist, squareform

# ---- Pairwise cosine distance in embedding space ----
# Cosine is the standard metric for neural embeddings because the magnitude
# of the activation vector is less meaningful than its direction.
D_deep = squareform(pdist(X_deep, metric="cosine"))

print(f"Deep distance matrix: {D_deep.shape}")
print(f"  min (non-self): {D_deep[D_deep > 0].min():.6f}")
print(f"  max:            {D_deep.max():.6f}")
print(f"  mean:           {D_deep[np.triu_indices(n, k=1)].mean():.6f}")
Deep distance matrix: (250, 250)
  min (non-self): 0.000000
  max:            0.956398
  mean:           0.494939

10.1 Deep nearest neighbors

Let us see who ResNet considers each flag’s closest neighbors. The results are revealing, the deep model picks up on spatial layout and texture patterns that our hand-crafted features were not designed to capture.

k=5 nearest neighbors in ResNet embedding space
# ---- Build nearest-neighbor table from deep distances ----
k = 5
rows_deep = []
for i in range(n):
    dists = D_deep[i].copy()
    dists[i] = np.inf
    nn_idx = np.argsort(dists)[:k]
    row = {"flag": names[i], "code": codes[i]}
    for rank, j in enumerate(nn_idx, start=1):
        row[f"neighbor_{rank}"] = names[j]
        row[f"dist_{rank}"]     = round(dists[j], 4)
    rows_deep.append(row)

df_nn_deep = pd.DataFrame(rows_deep)
itshow(df_nn_deep, lengthMenu=[5, 10, 25, 50], pageLength=10)
Loading ITables v2.7.0 from the internet... (need help?)

The deep model’s neighborhoods reveal a different kind of intelligence. The United Kingdom’s nearest neighbors are Turks and Caicos, Fiji, Bermuda, and Montserrat, all British Overseas Territories that fly blue ensigns with the Union Jack in the canton. Our artisanal features had matched the UK to Kiribati and Malaysia (similar color proportions), but ResNet actually sees the Union Jack pattern embedded in the corner and groups these flags together. That is a spatial relationship our hand-crafted features, which are all global summaries, cannot detect.

Similarly, Japan’s deep neighbors are Greenland, Bangladesh, and Palau, all flags with a single circular emblem on a plain field. The artisanal space had matched Japan to Cyprus (also a single emblem on white), but the deep model goes further and finds the circular disk pattern specifically, regardless of background color.

Germany’s deep neighbors are Indonesia, Austria, and Latvia, all horizontal stripe flags. France’s are Saint Martin, Italy, Peru, and Ivory Coast, all vertical tricolors. The deep model is reading the stripe orientation and spatial layout more precisely than our Hough Transform features, which only measure dominance ratios.

10.2 Comparing the two spaces

Now the key question: how much do the artisanal and deep distance matrices agree?

Artisanal vs. deep: Spearman correlation, neighbor overlap, Procrustes
from scipy.stats import spearmanr
from scipy.spatial import procrustes
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# ---- Artisanal distance matrix (Euclidean, standardized) ----
X_std = StandardScaler().fit_transform(df[feature_cols].values)
D_art = squareform(pdist(X_std, metric="euclidean"))

# ---- 1. Spearman rank correlation between all 31,125 pairwise distances ----
triu = np.triu_indices(n, k=1)
rho_dist, p_dist = spearmanr(D_art[triu], D_deep[triu])

# ---- 2. Neighbor overlap: fraction of shared neighbors at k=5 and k=10 ----
results = {}
for k in [5, 10]:
    overlaps = []
    for i in range(n):
        art_nn  = set(np.argsort(D_art[i])[1:k+1])
        deep_nn = set(np.argsort(D_deep[i])[1:k+1])
        overlaps.append(len(art_nn & deep_nn) / k)
    results[k] = overlaps

# ---- 3. Procrustes disparity (in 10-D PCA space) ----
# Reduce both embedding spaces to 10 dimensions, then measure how well
# one can be rotated/scaled to match the other.
pca_art  = PCA(n_components=10).fit_transform(X_std)
pca_deep = PCA(n_components=10).fit_transform(X_deep)
_, _, disparity = procrustes(pca_art, pca_deep)

print("=== Artisanal vs. Deep Space Comparison ===\n")
print(f"Spearman ρ (31,125 pairwise distances): {rho_dist:.4f}  (p ≈ {p_dist:.1e})")
print(f"\nNeighbor overlap at k=5:  mean = {np.mean(results[5]):.3f}, "
      f"median = {np.median(results[5]):.3f}")
print(f"Neighbor overlap at k=10: mean = {np.mean(results[10]):.3f}, "
      f"median = {np.median(results[10]):.3f}")
print(f"\nProcrustes disparity (10-D): {disparity:.4f}")
print(f"  (0 = identical geometry, 1 = unrelated)")
=== Artisanal vs. Deep Space Comparison ===

Spearman ρ (31,125 pairwise distances): 0.3681  (p ≈ 0.0e+00)

Neighbor overlap at k=5:  mean = 0.201, median = 0.200
Neighbor overlap at k=10: mean = 0.231, median = 0.200

Procrustes disparity (10-D): 0.7227
  (0 = identical geometry, 1 = unrelated)

The numbers paint a clear picture. A Spearman correlation of ~0.37 is statistically significant (p ≈ 0) but moderate, the two spaces agree on the broad strokes (very similar flags in one space tend to be somewhat similar in the other) but disagree substantially on the details. The neighbor overlap of ~20% at k=5 means that, on average, only 1 out of every 5 nearest neighbors is shared between the two representations. And the Procrustes disparity of ~0.72 (where 0 is identical and 1 is unrelated) confirms that the geometric structure of the two spaces is quite different.

This is exactly the outcome that makes both representations valuable. They are not redundant, they see different things.

Scatter: artisanal vs. deep pairwise distances
# ---- Scatter plot of artisanal vs deep distances ----
scatter_df = pd.DataFrame({
    "artisanal_euclidean": D_art[triu],
    "deep_cosine": D_deep[triu],
})

fig = px.scatter(
    scatter_df,
    x="artisanal_euclidean",
    y="deep_cosine",
    opacity=0.1,
    title=f"Artisanal vs. Deep Pairwise Distances (ρ = {rho_dist:.3f})",
    labels={
        "artisanal_euclidean": "Artisanal Distance (Euclidean, z-scored)",
        "deep_cosine": "Deep Distance (ResNet-50 cosine)",
    },
    width=750,
    height=550,
)
fig.update_traces(marker=dict(size=3))
fig.show()

The scatter shows a positive trend but with enormous spread. Pairs in the lower-left corner are similar in both spaces, these are the easy cases like Chad/Romania or France/Saint Martin where the flags are so alike that any representation picks them up. Pairs in the upper-right are dissimilar in both, flags with nothing in common by any measure.

The interesting cases are the off-diagonal ones. Pairs in the upper-left (low artisanal distance, high deep distance) are flags that share similar color statistics but look spatially different. A horizontal red-white-blue tricolor and a vertical one might have identical color percentages but very different spatial structure that ResNet detects. Pairs in the lower-right (high artisanal distance, low deep distance) are flags that look spatially similar to ResNet but differ in the numeric features, for example, two flags with the same layout template but in completely different color palettes.

10.3 Where the models disagree

Let us find the most interesting disagreements, pairs where the two spaces give contradictory rankings.

Pairs with largest rank disagreement between spaces
# ---- Rank each pair in both spaces ----
all_pairs_compare = pd.DataFrame({
    "flag_a": names[triu[0]],
    "code_a": codes[triu[0]],
    "flag_b": names[triu[1]],
    "code_b": codes[triu[1]],
    "d_artisanal": D_art[triu],
    "d_deep":      D_deep[triu],
})

all_pairs_compare["rank_art"]  = all_pairs_compare["d_artisanal"].rank()
all_pairs_compare["rank_deep"] = all_pairs_compare["d_deep"].rank()
all_pairs_compare["rank_diff"] = all_pairs_compare["rank_art"] - all_pairs_compare["rank_deep"]

# Artisanal says "similar" but Deep says "different" (large negative rank_diff)
art_close_deep_far = all_pairs_compare.nsmallest(10, "rank_diff")[
    ["flag_a", "flag_b", "d_artisanal", "d_deep", "rank_diff"]
].reset_index(drop=True)

# Deep says "similar" but Artisanal says "different" (large positive rank_diff)
deep_close_art_far = all_pairs_compare.nlargest(10, "rank_diff")[
    ["flag_a", "flag_b", "d_artisanal", "d_deep", "rank_diff"]
].reset_index(drop=True)

print("=== Artisanal says SIMILAR, Deep says DIFFERENT ===")
itshow(art_close_deep_far, lengthMenu=[10], pageLength=10)
=== Artisanal says SIMILAR, Deep says DIFFERENT ===
Loading ITables v2.7.0 from the internet... (need help?)
Pairs where Deep says similar but Artisanal says different
print("=== Deep says SIMILAR, Artisanal says DIFFERENT ===")
itshow(deep_close_art_far, lengthMenu=[10], pageLength=10)
=== Deep says SIMILAR, Artisanal says DIFFERENT ===
Loading ITables v2.7.0 from the internet... (need help?)
Visual comparison of the top disagreements
fig, axes = plt.subplots(5, 2, figsize=(10, 12))
fig.suptitle("Artisanal says SIMILAR, Deep says DIFFERENT",
             fontsize=13, fontweight="bold", y=1.02)

for row_idx in range(5):
    pair = art_close_deep_far.iloc[row_idx]
    for col, name_col in [(0, "flag_a"), (1, "flag_b")]:
        code_val = df.loc[df["name"] == pair[name_col], "code"].values[0]
        svg = flag_dir / f"{code_val}.svg"
        img = rasterize_flag(svg, width=320)
        axes[row_idx, col].imshow(img)
        axes[row_idx, col].set_title(pair[name_col], fontsize=8)
        axes[row_idx, col].axis("off")
    axes[row_idx, 0].annotate(
        f"Art d = {pair['d_artisanal']:.2f}\nDeep d = {pair['d_deep']:.2f}",
        xy=(1.05, 0.5), xycoords="axes fraction",
        fontsize=7, ha="left", va="center", color="gray",
    )

plt.tight_layout()
plt.show()

Pairs where Deep says similar but Artisanal says different
fig, axes = plt.subplots(5, 2, figsize=(10, 12))
fig.suptitle("Deep says SIMILAR, Artisanal says DIFFERENT",
             fontsize=13, fontweight="bold", y=1.02)

for row_idx in range(5):
    pair = deep_close_art_far.iloc[row_idx]
    for col, name_col in [(0, "flag_a"), (1, "flag_b")]:
        code_val = df.loc[df["name"] == pair[name_col], "code"].values[0]
        svg = flag_dir / f"{code_val}.svg"
        img = rasterize_flag(svg, width=320)
        axes[row_idx, col].imshow(img)
        axes[row_idx, col].set_title(pair[name_col], fontsize=8)
        axes[row_idx, col].axis("off")
    axes[row_idx, 0].annotate(
        f"Art d = {pair['d_artisanal']:.2f}\nDeep d = {pair['d_deep']:.2f}",
        xy=(1.05, 0.5), xycoords="axes fraction",
        fontsize=7, ha="left", va="center", color="gray",
    )

plt.tight_layout()
plt.show()

The disagreement strips make the complementarity vivid. In the left columns (artisanal-close, deep-far), pairs share similar color statistics, the same proportions of red, blue, white, but their spatial layouts are very different. One might be a horizontal tricolor while the other has a diagonal stripe or a complex emblem. Our 19 features, which are all global averages, cannot distinguish these layouts, but ResNet’s convolutional layers can.

In the right columns (deep-close, artisanal-far), pairs look spatially similar to the neural network, similar layouts, similar placement of elements, but differ in color. A predominantly red flag and a predominantly green flag with the same stripe arrangement would end up here. ResNet’s learned features partially abstract away from color (especially in deeper layers), focusing instead on edges, textures, and composition.

This confirms the value of our dual approach: the artisanal features capture what colors are present and how they relate, while the deep features capture how the flag is spatially organized. Neither alone tells the whole story.

11 Dimensionality Reduction and Clustering

We now have two complementary distance matrices: one from 19 hand-crafted features (Euclidean, z-scored) and one from 2048 ResNet-50 embeddings (cosine). Each captures a different facet of flag similarity. Rather than choosing one, we fuse them: normalize both to [0, 1] and take a 50/50 average. The result is a single distance matrix that benefits from the color-and-ratio sensitivity of the artisanal features and the spatial-layout intelligence of the deep model.

We then use UMAP (Uniform Manifold Approximation and Projection) to compress this fused 250x250 distance matrix into 2 dimensions for visualization, and HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to discover natural groupings in the UMAP embedding. HDBSCAN is a good fit here because it does not require us to specify the number of clusters in advance, and it can label outlier flags as “noise” rather than forcing every flag into a group.

Fuse artisanal and deep distance matrices
# ---- Normalize both distance matrices to [0, 1] ----
# This ensures that neither matrix dominates the average simply due to scale.
D_art_norm  = D_euclidean / D_euclidean.max()
D_deep_norm = D_deep / D_deep.max()

# ---- 50/50 average ----
D_fused = 0.5 * D_art_norm + 0.5 * D_deep_norm

print(f"Fused distance matrix: {D_fused.shape}")
print(f"  Range: [{D_fused[D_fused > 0].min():.4f}, {D_fused.max():.4f}]")
print(f"  Mean:  {D_fused[np.triu_indices(n, k=1)].mean():.4f}")
Fused distance matrix: (250, 250)
  Range: [0.0000, 0.8956]
  Mean:  0.4934

11.1 UMAP projections

We run UMAP three times, once for each distance matrix, to see how each representation organizes the 250 flags in 2D. The fused map should inherit the best of both worlds.

UMAP 2D projections from artisanal, deep, and fused distances
import umap

# ---- UMAP from precomputed distance matrices ----
umap_results = {}
for label, D in [("Artisanal", D_euclidean), ("Deep", D_deep), ("Fused", D_fused)]:
    reducer = umap.UMAP(
        n_neighbors=15,
        min_dist=0.1,
        metric="precomputed",
        random_state=42,
    )
    emb_2d = reducer.fit_transform(D)
    umap_results[label] = emb_2d
    print(f"UMAP {label}: done")

# We will use the fused embedding going forward
umap_fused = umap_results["Fused"]
UMAP Artisanal: done
UMAP Deep: done
UMAP Fused: done
Side-by-side UMAP maps: artisanal, deep, and fused
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=["Artisanal Features", "ResNet-50 Embeddings", "Fused (50/50)"],
    horizontal_spacing=0.05,
)

for idx, (label, emb) in enumerate(umap_results.items(), start=1):
    fig.add_trace(
        go.Scatter(
            x=emb[:, 0], y=emb[:, 1],
            mode="markers+text",
            text=codes,
            textposition="top center",
            textfont=dict(size=6),
            marker=dict(size=5, opacity=0.7),
            hovertext=[f"{n} ({c})" for n, c in zip(names, codes)],
            hoverinfo="text",
            showlegend=False,
        ),
        row=1, col=idx,
    )

fig.update_layout(
    title="UMAP Projections: Three Views of Flag Space",
    width=1100, height=450,
    margin=dict(t=60, b=30),
)
for i in range(1, 4):
    fig.update_xaxes(showticklabels=False, row=1, col=i)
    fig.update_yaxes(showticklabels=False, row=1, col=i)

fig.show()

The three maps tell a story about what each representation values. The artisanal map organizes flags primarily by color palette: red-dominant flags cluster on one side, blue-dominant on another, with the multi-color complex flags forming a separate peninsula. The deep map groups by spatial layout: horizontal stripes, vertical tricolors, canton-based ensigns, and single-emblem-on-field designs each carve out their own regions. The fused map inherits both organizing principles, flags that share both color and layout end up tightly clustered, while flags that match on only one dimension sit at intermediate distances.

11.2 HDBSCAN clustering

HDBSCAN clustering on the fused UMAP embedding
import hdbscan

# ---- Cluster the fused UMAP embedding ----
clusterer = hdbscan.HDBSCAN(min_cluster_size=8, min_samples=3)
cluster_labels = clusterer.fit_predict(umap_fused)

n_clusters = len(set(cluster_labels) - {-1})
n_noise = (cluster_labels == -1).sum()

# ---- Attach cluster labels to the DataFrame ----
df["cluster"] = cluster_labels
df["umap_x"] = umap_fused[:, 0]
df["umap_y"] = umap_fused[:, 1]

print(f"HDBSCAN found {n_clusters} clusters and {n_noise} noise points")
print(f"\nCluster sizes:")
for c in sorted(set(cluster_labels)):
    mask = cluster_labels == c
    label_str = f"Cluster {c}" if c >= 0 else "Noise"
    members = names[mask]
    print(f"  {label_str:12s} ({mask.sum():3d} flags): {', '.join(members[:6])}...")
HDBSCAN found 13 clusters and 35 noise points

Cluster sizes:
  Noise        ( 35 flags): Afghanistan, Algeria, Antarctica, Azerbaijan, Bhutan, Botswana...
  Cluster 0    ( 20 flags): Anguilla, Australia, Bermuda, British Indian Ocean Territory, British Virgin Islands, Cayman Islands...
  Cluster 1    ( 10 flags): Aruba, Bosnia and Herzegovina, Cape Verde, Curaçao, Kosovo, Micronesia...
  Cluster 2    ( 22 flags): Angola, Antigua and Barbuda, Bahamas, Egypt, Guadeloupe, Iraq...
  Cluster 3    ( 17 flags): American Samoa, Caribbean Netherlands, Comoros, Cuba, Czechia, Eritrea...
  Cluster 4    ( 11 flags): Bangladesh, Brazil, Christmas Island, Cocos (Keeling) Islands, Kazakhstan, Macau...
  Cluster 5    (  8 flags): Armenia, Colombia, Gabon, Lithuania, Mauritius, Rwanda...
  Cluster 6    ( 21 flags): Bolivia, Brunei, Burkina Faso, Burundi, Dominica, Ecuador...
  Cluster 7    ( 11 flags): Albania, China, Hong Kong, Isle of Man, Kyrgyzstan, Montenegro...
  Cluster 8    ( 24 flags): Andorra, Barbados, Belarus, Belgium, Benin, Cameroon...
  Cluster 9    ( 10 flags): Bahrain, Denmark, Qatar, Samoa, Switzerland, Taiwan...
  Cluster 10   ( 11 flags): Bouvet Island, Canada, Cyprus, Faroe Islands, Georgia, Guernsey...
  Cluster 11   ( 31 flags): Austria, Bulgaria, Cambodia, Costa Rica, Gambia, Greenland...
  Cluster 12   ( 19 flags): Argentina, Belize, Croatia, Dominican Republic, El Salvador, Finland...
Interactive UMAP map colored by cluster
# ---- Color by cluster, noise in gray ----
df["cluster_label"] = df["cluster"].apply(
    lambda c: f"Cluster {c}" if c >= 0 else "Noise"
)

fig = px.scatter(
    df,
    x="umap_x",
    y="umap_y",
    color="cluster_label",
    hover_name="name",
    hover_data={"code": True, "umap_x": False, "umap_y": False, "cluster_label": False},
    text="code",
    title="Flag Clusters (HDBSCAN on Fused UMAP)",
    labels={"umap_x": "", "umap_y": ""},
    width=850,
    height=650,
)
fig.update_traces(
    textposition="top center",
    textfont=dict(size=6),
    marker=dict(size=7),
)
# ---- Make noise points dark gray so they don't compete with cluster colors ----
for trace in fig.data:
    if trace.name == "Noise":
        trace.marker.color = "rgba(80, 80, 80, 0.5)"
        trace.marker.size = 5
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.update_layout(legend_title_text="Cluster", margin=dict(t=50, b=30))
fig.show()

11.3 Cluster portraits

What does each cluster look like? For every cluster we show a grid of its member flags together with a summary of the feature profiles that define it.

Mean feature profile per cluster
# ---- Compute mean feature values per cluster (excluding noise) ----
cluster_profiles = (
    df[df["cluster"] >= 0]
    .groupby("cluster")[feature_cols]
    .mean()
    .round(3)
)

itshow(cluster_profiles.T, lengthMenu=[10, 19], pageLength=19)
Loading ITables v2.7.0 from the internet... (need help?)
Flag grids for each cluster
clusters_sorted = sorted([c for c in set(cluster_labels) if c >= 0])

for c in clusters_sorted:
    mask = cluster_labels == c
    member_codes = codes[mask]
    member_names = names[mask]
    n_members = len(member_codes)

    # Grid layout: up to 6 columns
    ncols = min(6, n_members)
    nrows = int(np.ceil(n_members / ncols))

    fig, axes = plt.subplots(nrows, ncols, figsize=(ncols * 2.2, nrows * 1.5))
    if nrows == 1:
        axes = np.array(axes).reshape(1, -1)
    fig.suptitle(f"Cluster {c}  ({n_members} flags)", fontsize=12, y=1.02)

    for idx in range(nrows * ncols):
        r, col_idx = divmod(idx, ncols)
        if idx < n_members:
            svg = flag_dir / f"{member_codes[idx]}.svg"
            img = rasterize_flag(svg, width=240)
            axes[r, col_idx].imshow(img)
            axes[r, col_idx].set_title(member_names[idx], fontsize=6)
        axes[r, col_idx].axis("off")

    plt.tight_layout()
    plt.show()

The cluster grids make the organizing logic visible at a glance. Each cluster coheres around a shared design template: horizontal tricolors with similar hue sequences, vertical tricolors, canton-based blue ensigns, diagonal multicolor designs, single-emblem-on-solid-field arrangements, and so on. The noise points are flags that do not fit neatly into any group; these tend to be the most distinctive designs in the dataset, like Nepal’s double pennant, Bhutan’s dragon, or the Vatican’s papal keys.

11.4 Cluster stability

How robust are these clusters? Would small changes in the data or the UMAP parameters break them apart? We test this by re-running the full pipeline (UMAP + HDBSCAN) 20 times with different random seeds and measuring how consistently each pair of flags ends assigned to the same cluster.

Cluster co-assignment stability across 20 random seeds
# ---- Run UMAP+HDBSCAN 20 times with different seeds ----
n_runs = 20
co_assignment = np.zeros((n, n))

for seed in range(n_runs):
    reducer = umap.UMAP(
        n_neighbors=15, min_dist=0.1,
        metric="precomputed", random_state=seed,
    )
    emb = reducer.fit_transform(D_fused)
    labels = hdbscan.HDBSCAN(min_cluster_size=8, min_samples=3).fit_predict(emb)

    # For each pair, record if they were in the same (non-noise) cluster
    for c in set(labels):
        if c < 0:
            continue
        members = np.where(labels == c)[0]
        for i in members:
            for j in members:
                co_assignment[i, j] += 1

co_assignment /= n_runs  # normalize to [0, 1]

# ---- Summary statistics ----
triu_vals = co_assignment[np.triu_indices(n, k=1)]
print(f"Co-assignment matrix: {co_assignment.shape}")
print(f"  Mean co-assignment probability: {triu_vals.mean():.3f}")
print(f"  Pairs always together (p=1.0):  {(triu_vals == 1.0).sum()}")
print(f"  Pairs never together (p=0.0):   {(triu_vals == 0.0).sum()}")

# ---- Average stability per flag ----
stability_per_flag = []
for i in range(n):
    # Mean co-assignment with flags in the same primary cluster
    primary_cluster = cluster_labels[i]
    if primary_cluster >= 0:
        same = np.where(cluster_labels == primary_cluster)[0]
        same = same[same != i]
        if len(same) > 0:
            stability_per_flag.append(co_assignment[i, same].mean())
        else:
            stability_per_flag.append(0.0)
    else:
        stability_per_flag.append(0.0)

df["stability"] = stability_per_flag
mean_stab = np.mean([s for s in stability_per_flag if s > 0])
print(f"  Mean within-cluster stability: {mean_stab:.3f}")
Co-assignment matrix: (250, 250)
  Mean co-assignment probability: 0.140
  Pairs always together (p=1.0):  336
  Pairs never together (p=0.0):   4594
  Mean within-cluster stability: 0.678
Co-assignment heatmap (ordered by cluster)
# ---- Reorder by cluster for visualization ----
order_stab = np.argsort(cluster_labels)
co_ordered = co_assignment[np.ix_(order_stab, order_stab)]
names_stab_ordered = names[order_stab]

fig = px.imshow(
    co_ordered,
    x=names_stab_ordered,
    y=names_stab_ordered,
    color_continuous_scale="Blues",
    labels=dict(color="Co-assignment Prob."),
    title="Cluster Co-assignment Stability (20 UMAP seeds)",
    aspect="equal",
    width=850,
    height=850,
)
fig.update_layout(
    xaxis=dict(tickfont=dict(size=5), tickangle=90),
    yaxis=dict(tickfont=dict(size=5)),
    margin=dict(l=120, r=20, t=50, b=120),
)
fig.show()

The co-assignment heatmap reveals which clusters are rock-solid and which are more fluid. The darkest diagonal blocks, pairs that stay together in 100% of the 20 runs, represent the most natural, unambiguous groupings in flag space. Lighter blocks at the boundaries indicate flags that sometimes get assigned to a neighboring cluster, revealing the fuzzy frontiers between design families. The noise flags (typically in the upper-left, since they sort first) show near-zero co-assignment with everything, confirming their status as genuine outliers.

12 Hypothesis Engine

We have discovered that national flags organize into coherent visual clusters. The natural next question is: why? Are these clusters random, or do they correlate with real-world properties of the nations behind them, their geography, history, wealth, or political status?

To answer this, we enrich our dataset with country-level metadata from the REST Countries API: geographic region, subregion, continent, latitude, longitude, population, area, landlocked status, independence, UN membership, Gini coefficient (income inequality), number of official languages, number of land borders, and driving side. We then run a battery of statistical tests to discover which of these variables are associated with flag design.

Fetch country metadata from REST Countries API
import requests

# ---- Two API calls (max 10 fields each) ----
fields_geo = "name,cca2,region,subregion,latlng,population,area,landlocked,independent"
fields_cul = "name,cca2,languages,gini,continents,borders,unMember,car"

batch_geo = {c["cca2"].lower(): c for c in
             requests.get(f"https://restcountries.com/v3.1/all?fields={fields_geo}").json()}
batch_cul = {c["cca2"].lower(): c for c in
             requests.get(f"https://restcountries.com/v3.1/all?fields={fields_cul}").json()}

# ---- Build metadata table ----
meta_rows = []
for code in codes:
    g = batch_geo.get(code, {})
    c = batch_cul.get(code, {})
    ll = g.get("latlng", [])
    lat, lng = (ll[0], ll[1]) if len(ll) == 2 else (None, None)
    gini_dict = c.get("gini", {})
    langs = c.get("languages", {})

    meta_rows.append({
        "code":         code,
        "region":       g.get("region"),
        "subregion":    g.get("subregion"),
        "continent":    c.get("continents", [None])[0] if c.get("continents") else None,
        "latitude":     lat,
        "longitude":    lng,
        "abs_latitude":  abs(lat) if lat is not None else None,
        "population":   g.get("population"),
        "area_km2":     g.get("area"),
        "landlocked":   g.get("landlocked"),
        "independent":  g.get("independent"),
        "un_member":    c.get("unMember"),
        "n_languages":  len(langs) if langs else None,
        "n_borders":    len(c.get("borders", [])),
        "gini":         max(gini_dict.values()) if gini_dict else None,
        "drive_side":   c.get("car", {}).get("side"),
    })

df_meta = pd.DataFrame(meta_rows)

# ---- Merge with flag features + cluster labels ----
df_full = df.merge(df_meta, on="code")
df_clust = df_full[df_full["cluster"] >= 0].copy()

print(f"Metadata loaded for {len(df_meta)} entities")
print(f"Flags in clusters: {len(df_clust)}, Noise points: {(df_full['cluster'] < 0).sum()}")
print(f"Gini coverage: {df_meta['gini'].notna().sum()} / {len(df_meta)}")
Metadata loaded for 250 entities
Flags in clusters: 215, Noise points: 35
Gini coverage: 167 / 250

12.1 Do clusters reflect geography?

The most fundamental question: do flags that look alike come from the same part of the world? We test this with a chi-squared test of association between cluster membership and geographic region (or continent, or subregion). The effect size is measured by Cramer’s V, which ranges from 0 (no association) to 1 (perfect association).

Chi-squared tests: cluster vs. geographic and political variables
from scipy.stats import chi2_contingency

# ---- Chi-squared tests for categorical variables ----
cat_vars = ["region", "continent", "subregion", "landlocked",
            "independent", "un_member", "drive_side"]

chi_results = []
for var in cat_vars:
    ct = pd.crosstab(df_clust["cluster"], df_clust[var])
    if ct.shape[0] > 1 and ct.shape[1] > 1:
        chi2, p, dof, _ = chi2_contingency(ct)
        cramers_v = np.sqrt(chi2 / (ct.sum().sum() * (min(ct.shape) - 1)))
        chi_results.append({
            "variable": var,
            "chi2": round(chi2, 1),
            "p_value": p,
            "cramers_v": round(cramers_v, 3),
            "significant": "Yes" if p < 0.01 else "No",
        })

df_chi = pd.DataFrame(chi_results).sort_values("cramers_v", ascending=False)
itshow(df_chi, pageLength=10)
Loading ITables v2.7.0 from the internet... (need help?)

The results are striking. Subregion shows the strongest association (Cramer’s V ~ 0.48), followed by continent and region, all highly significant (p < 0.001). This means flag clusters are not randomly distributed across the globe: certain visual designs concentrate in specific parts of the world. Pan-African color schemes cluster African nations, the Nordic cross unites Scandinavian countries, the blue-ensign template connects the former British Empire, and Pan-Arab tricolors group Middle Eastern and North African states.

Independence and UN membership also show significant (p < 0.01) associations, reflecting the fact that dependent territories (colonies, overseas departments) tend to inherit their sovereign’s flag design, the blue ensign cluster is almost entirely made up of British dependencies.

Landlocked and drive side show no significant association with flag design, which makes sense: these are practical facts about a country that have no reason to influence its symbols.

12.2 Do clusters reflect latitude?

The “Solar Determinism” hypothesis: do countries closer to the equator use warmer colors (red, yellow) while countries at higher latitudes prefer cooler, simpler designs?

Kruskal-Wallis tests: cluster vs. continuous variables
from scipy.stats import kruskal

# ---- Kruskal-Wallis for continuous variables ----
cont_vars = ["abs_latitude", "latitude", "longitude", "population",
             "area_km2", "gini", "n_languages", "n_borders"]

kw_results = []
for var in cont_vars:
    valid = df_clust.dropna(subset=[var])
    groups = [g[var].values for _, g in valid.groupby("cluster")]
    groups = [g for g in groups if len(g) > 0]
    if len(groups) > 1:
        H, p = kruskal(*groups)
        N, k = len(valid), len(groups)
        eta_sq = (H - k + 1) / (N - k)
        kw_results.append({
            "variable": var,
            "H_statistic": round(H, 1),
            "p_value": p,
            "eta_squared": round(eta_sq, 3),
            "significant": "Yes" if p < 0.01 else "No",
        })

df_kw = pd.DataFrame(kw_results).sort_values("eta_squared", ascending=False)
itshow(df_kw, pageLength=10)
Loading ITables v2.7.0 from the internet... (need help?)

Absolute latitude is significantly associated with cluster membership (p < 0.001, η² ~ 0.06). This supports the Solar Determinism hypothesis: tropical nations genuinely tend to land in different flag clusters than temperate or polar ones. Population, area, and number of borders are also significant, larger, more connected countries end up in different design families than small island territories. The Gini coefficient (income inequality) does not reach significance at the cluster level, but as we will see, it shows compelling feature-level correlations.

12.3 Feature-level correlations

Beyond cluster membership, which individual flag features correlate with which country properties? This gives us more granular insight.

Spearman correlations: flag features vs. absolute latitude
from scipy.stats import spearmanr

# ---- Features vs |latitude| ----
v = df_full.dropna(subset=["abs_latitude"])
solar_rows = []
for f in feature_cols:
    rho, p = spearmanr(v["abs_latitude"], v[f])
    solar_rows.append({"feature": f, "rho": round(rho, 3), "p_value": p, "abs_rho": abs(rho)})

df_solar = (pd.DataFrame(solar_rows)
            .sort_values("abs_rho", ascending=False)
            .drop(columns="abs_rho")
            .reset_index(drop=True))
itshow(df_solar, pageLength=19)
Loading ITables v2.7.0 from the internet... (need help?)

The Solar Determinism hypothesis finds real support in the data. As latitude increases (moving away from the equator):

  • Yellow and green percentages decrease (ρ ~ -0.29, -0.29), tropical nations use significantly more warm greens and yellows, the colors of vegetation, earth, and sunlight.
  • White percentage increases (ρ ~ +0.24), higher-latitude nations use more white, the color of snow and winter.
  • Vertical dominance increases (ρ ~ +0.22), northern nations favor vertical stripes (European tricolors), while tropical nations show more diagonal elements.
  • Palette complexity and visual entropy decrease with latitude, temperate and polar flags tend to be simpler.
Yellow percentage vs. absolute latitude
fig = px.scatter(
    df_full.dropna(subset=["abs_latitude"]),
    x="abs_latitude",
    y="yellow_pct",
    hover_name="name",
    trendline="ols",
    title="Solar Determinism: Yellow Fades with Latitude",
    labels={"abs_latitude": "Absolute Latitude (°)", "yellow_pct": "Yellow Percentage"},
    width=750,
    height=500,
)
fig.update_traces(marker=dict(size=6, opacity=0.6))
fig.show()
Spearman correlations: flag features vs. Gini coefficient
# ---- Features vs Gini (inequality) ----
v3 = df_full.dropna(subset=["gini"])
gini_rows = []
for f in feature_cols:
    rho, p = spearmanr(v3["gini"], v3[f])
    gini_rows.append({"feature": f, "rho": round(rho, 3), "p_value": p, "abs_rho": abs(rho)})

df_gini = (pd.DataFrame(gini_rows)
           .sort_values("abs_rho", ascending=False)
           .drop(columns="abs_rho")
           .reset_index(drop=True))
itshow(df_gini, pageLength=19)
Loading ITables v2.7.0 from the internet... (need help?)

The Gini correlation table reveals something unexpected and fascinating. Higher income inequality is associated with:

  • More palette complexity (ρ ~ +0.32), the strongest single correlation. Countries with greater inequality tend to have flags with more distinct colors.
  • More yellow (ρ ~ +0.28) and more green (ρ ~ +0.22), the warm colors of the Global South.
  • Higher visual entropy (ρ ~ +0.27) and edge density (ρ ~ +0.21), more complex, busier flag designs.
  • Less red (ρ ~ -0.19) and less aggression (ρ ~ -0.17).

This is likely a confound with geography (high-Gini countries tend to be tropical, post-colonial nations), but it raises a provocative question: do nations with more complex social stratification produce more complex national symbols? Or is it simply that the Pan-African and Pan-American design traditions, which emphasize multi-color richness, happen to belong to regions with higher inequality?

Palette complexity vs. Gini coefficient
fig = px.scatter(
    df_full.dropna(subset=["gini"]),
    x="gini",
    y="palette_complexity",
    hover_name="name",
    color="region",
    trendline="ols",
    title="Inequality and Flag Complexity: More Gini, More Colors",
    labels={"gini": "Gini Coefficient", "palette_complexity": "Palette Complexity (distinct colors)"},
    width=750,
    height=500,
)
fig.update_traces(marker=dict(size=7, opacity=0.7))
fig.show()

Coloring by region in the scatter above helps disentangle the confound. The trend is not entirely driven by geography: within every region, there is a positive slope. African nations vary enormously in both Gini and palette complexity, and the correlation holds within Africa alone. European flags are clustered at the low end of both axes, but even there, more unequal European countries (like Portugal, with its complex coat of arms) have slightly more complex flags.

Spearman correlations: flag features vs. log(population)
# ---- Features vs log(population) ----
v2 = df_full[df_full["population"] > 0].copy()
v2["log_pop"] = np.log10(v2["population"])

pop_rows = []
for f in feature_cols:
    rho, p = spearmanr(v2["log_pop"], v2[f])
    pop_rows.append({"feature": f, "rho": round(rho, 3), "p_value": p, "abs_rho": abs(rho)})

df_pop = (pd.DataFrame(pop_rows)
          .sort_values("abs_rho", ascending=False)
          .drop(columns="abs_rho")
          .reset_index(drop=True))
itshow(df_pop, pageLength=19)
Loading ITables v2.7.0 from the internet... (need help?)

Population reveals a complementary pattern. Larger nations have flags that are:

  • Less blue (ρ ~ -0.31), small island territories and former colonies (which use blue ensigns) dominate the small-population end.
  • Simpler (lower palette complexity, ρ ~ -0.21; lower edge density, ρ ~ -0.28), large nations can afford iconic, instantly recognizable designs.
  • More symmetric (ρ ~ +0.26), a simple, symmetric flag works better at scale, on everything from passport covers to UN flagpoles.
  • More aggressive (higher red, ρ ~ +0.20; higher aggression index, ρ ~ +0.23), large nations lean toward the red end of the spectrum.
  • Less diagonal (ρ ~ -0.21), large nations prefer the stability of horizontal and vertical stripes.
Blue percentage vs. log(population)
fig = px.scatter(
    v2,
    x="log_pop",
    y="blue_pct",
    hover_name="name",
    color="region",
    title="Big Nations Avoid Blue: Population vs. Blue Percentage",
    labels={"log_pop": "log₁₀(Population)", "blue_pct": "Blue Percentage"},
    width=750,
    height=500,
)
fig.update_traces(marker=dict(size=7, opacity=0.7))
fig.show()

The blue-population scatter confirms this vividly. The upper-left quadrant (small population, high blue) is packed with Oceanian and Caribbean territories, Anguilla, Montserrat, Tuvalu, Guam, all flying blue-field flags inherited from colonial powers. The lower-right (large population, low blue) holds the major nations: China, India, Indonesia, Brazil, Nigeria. This is partly a colonial confound (small territories = dependencies = inherited blue ensigns), but it suggests a genuine rule of vexillography: big countries need bold, distinctive flags; small territories can afford to blend in.

12.4 Cluster composition by region

Cluster vs. Region heatmap
# ---- Crosstab: cluster x region (proportions within cluster) ----
ct = pd.crosstab(df_clust["cluster"], df_clust["region"], normalize="index")

fig = px.imshow(
    ct.values,
    x=ct.columns.tolist(),
    y=[f"Cluster {c}" for c in ct.index],
    color_continuous_scale="YlOrRd",
    labels=dict(color="Proportion"),
    title="Regional Composition of Each Flag Cluster",
    aspect="auto",
    width=750,
    height=500,
)
fig.update_layout(margin=dict(l=100, r=20, t=50, b=50))
fig.show()

The cluster-by-region heatmap is the clearest summary of our findings. Some clusters are overwhelmingly dominated by a single region, these are the design traditions inherited from colonial or cultural history. Others are genuinely cross-continental, grouping visually similar flags from unrelated parts of the world. The latter are the most interesting: they suggest universal principles of design convergence, where unrelated nations independently arrived at similar visual solutions.

12.5 Region composition by cluster

The heatmap above shows what each cluster is made of. The reverse question is equally revealing: where does each region’s flags end up? Do all African flags land in the same cluster, or are they scattered across several design families?

Region → Cluster distribution heatmap
# ---- Crosstab: region x cluster (proportions within region) ----
ct_rev = pd.crosstab(df_clust["region"], df_clust["cluster"], normalize="index")

fig = px.imshow(
    ct_rev.values,
    x=[f"Cluster {c}" for c in ct_rev.columns],
    y=ct_rev.index.tolist(),
    color_continuous_scale="YlOrRd",
    labels=dict(color="Proportion"),
    title="Where Does Each Region's Flags End Up?",
    aspect="auto",
    width=750,
    height=400,
)
fig.update_layout(margin=dict(l=100, r=20, t=50, b=50))
fig.show()

The reverse heatmap tells a complementary story. Some regions scatter their flags across many clusters: Africa spans at least five different design families, reflecting the diversity of Pan-African, Pan-Arab, and post-colonial design traditions coexisting on the same continent. Oceania, by contrast, concentrates into one or two clusters, reflecting the overwhelming influence of the British blue-ensign template. Europe splits cleanly between horizontal and vertical tricolors, with a few outliers in the red-dominant clusters. The Americas distribute more evenly, with Caribbean territories landing in different clusters than Central and South American nations.

12.6 Beyond geography: climate, wealth, and strange correlations

We have shown that flag clusters correlate with geography. But geography is a proxy for many things: climate, colonial history, economic development, cultural traditions. To disentangle these factors, we enriched our dataset with external data from two APIs. From the Open-Meteo Archive API we obtained 2023 daily climate records for each country’s coordinates: average annual temperature, total precipitation, and total sunshine hours. From the World Bank API we fetched GDP per capita, life expectancy, and forest cover percentage.

Load external data: climate (Open-Meteo) + development (World Bank)
# ---- Load the external data we fetched via API ----
df_extra = pd.read_csv("data/extra_metadata.csv")

# ---- Merge into our working DataFrame ----
df_full = df_full.merge(df_extra, on="code", how="left")
df_clust = df_full[df_full["cluster"] >= 0].copy()

print("Extra variable coverage:")
for col in ["avg_temp_c", "annual_precip_mm", "annual_sunshine_hrs",
            "gdp_per_capita", "life_expectancy", "forest_pct"]:
    n = df_full[col].notna().sum()
    print(f"  {col:25s}: {n}/{len(df_full)}")
Extra variable coverage:
  avg_temp_c               : 250/250
  annual_precip_mm         : 250/250
  annual_sunshine_hrs      : 250/250
  gdp_per_capita           : 208/250
  life_expectancy          : 216/250
  forest_pct               : 213/250

With 12 external variables and 19 flag features, we have 228 possible correlations. Most of them are noise. Rather than testing every combination and pretending to find patterns, we took the opposite approach: we computed all 228 Spearman correlations, kept only those with |ρ| > 0.20 and p < 0.01, and discarded the rest. The full matrix is shown at the end of this section; here we focus on the handful of relationships that are genuinely strong.

12.7 The full correlation map

Let us start with the big picture. The heatmap below shows every Spearman correlation between external variables and flag features. Cells colored deep red or deep blue represent real, strong associations; cells near white represent statistical noise.

Full Spearman correlation heatmap: external variables vs. flag features
from scipy.stats import spearmanr

# ---- Compute all pairwise Spearman correlations ----
external_vars = ["abs_latitude", "avg_temp_c", "annual_precip_mm",
                 "annual_sunshine_hrs", "gdp_per_capita", "life_expectancy",
                 "forest_pct", "gini", "population", "area_km2",
                 "n_languages", "n_borders"]

corr_matrix = []
pval_matrix = []
for ext in external_vars:
    row_rho, row_p = [], []
    v = df_full.dropna(subset=[ext]).copy()
    if ext in ["gdp_per_capita", "population", "area_km2"]:
        v[ext] = np.log10(v[ext].clip(lower=1))
    for f in feature_cols:
        rho, p = spearmanr(v[ext], v[f])
        row_rho.append(round(rho, 3))
        row_p.append(p)
    corr_matrix.append(row_rho)
    pval_matrix.append(row_p)

corr_df = pd.DataFrame(corr_matrix, index=external_vars, columns=feature_cols)
pval_df = pd.DataFrame(pval_matrix, index=external_vars, columns=feature_cols)

# ---- Mask non-significant correlations for annotation ----
annot = corr_df.copy().astype(str)
for i in range(len(external_vars)):
    for j in range(len(feature_cols)):
        rho = corr_matrix[i][j]
        p = pval_matrix[i][j]
        if abs(rho) >= 0.20 and p < 0.01:
            annot.iloc[i, j] = f"{rho:+.2f}"
        else:
            annot.iloc[i, j] = ""

fig = px.imshow(
    corr_df.values,
    x=feature_cols,
    y=external_vars,
    color_continuous_scale="RdBu_r",
    zmin=-0.45, zmax=0.45,
    labels=dict(color="Spearman ρ"),
    title="The Full Correlation Map: External Variables vs. Flag Features",
    aspect="auto",
    width=850,
    height=500,
)
fig.update_layout(
    xaxis=dict(tickangle=45, tickfont=dict(size=9)),
    yaxis=dict(tickfont=dict(size=10)),
    margin=dict(l=120, r=20, t=50, b=100),
)
fig.show()

Three coherent structures jump out of the heatmap:

  1. The latitude-temperature-wealth axis. Absolute latitude, temperature (inverted), GDP per capita, and life expectancy all correlate with the same flag features in the same direction. This is the signature of the Global North vs. Global South divide: high-latitude, wealthy, long-lived nations favor white, blue, simple, vertically-striped flags; equatorial, poorer, shorter-lived nations favor yellow, green, complex, horizontally-striped designs. These four variables are so correlated with each other (latitude ↔︎ GDP: ρ ~ 0.55; GDP ↔︎ life expectancy: ρ ~ 0.85) that they form a single underlying dimension.

  2. The colonial-size fingerprint. Population shows the clearest pattern of any single variable: small territories use significantly more blue (ρ = -0.33) and more edge density (ρ = -0.31), while large nations use more red (ρ = +0.19) and more symmetry (ρ = +0.28). This reflects the blue-ensign inheritance of small British dependencies.

  3. The null results. Forest cover shows essentially zero correlation with green in the flag (ρ = -0.09, p = 0.20). Precipitation shows no correlation with palette complexity (ρ = 0.05, p = 0.44). Sunshine hours show no correlation with blue percentage (ρ = -0.01, p = 0.82). These are clean negative results, and they matter: flag colors are symbolic, not representational. A country does not put green on its flag because it has forests, or blue because it lacks sunshine.

12.8 The strongest signals: development and flag simplicity

The single strongest correlations in our dataset involve life expectancy and GDP per capita versus the color green. Countries where people live longer use dramatically less green in their flags (ρ = -0.40, p < 0.001). This is the strongest individual correlation in the entire analysis.

Strip plot: green percentage by development tier
# ---- Create development tiers for visualization ----
v_dev = df_full.dropna(subset=["life_expectancy"]).copy()
v_dev["dev_tier"] = pd.qcut(v_dev["life_expectancy"], q=4,
    labels=["Bottom 25%\n(< 65 yr)", "25-50%\n(65-73 yr)",
            "50-75%\n(73-78 yr)", "Top 25%\n(> 78 yr)"])

fig = px.strip(
    v_dev,
    x="dev_tier",
    y="green_pct",
    hover_name="name",
    color="region",
    title="The Richer They Are, the Less Green They Fly (ρ = −0.40)",
    labels={"dev_tier": "Life Expectancy Quartile", "green_pct": "Green %"},
    width=750,
    height=500,
)
fig.update_traces(marker=dict(size=7, opacity=0.7))
fig.update_layout(legend_title_text="Region")
fig.show()

The strip plot makes the pattern vivid. In the bottom quartile of life expectancy (mostly Sub-Saharan Africa), green percentages spread from 0 to 65%, with many flags using green as a primary color. In the top quartile (mostly Europe and East Asia), green is nearly absent. The mechanism is cultural, not biological: the green-heavy flag traditions (Pan-African, Pan-Arab, Islamic) belong to regions that happen to have lower life expectancy due to historical underdevelopment, not because of any property of the color green itself.

The same pattern holds for GDP per capita versus green (ρ = -0.39), and a mirror-image pattern holds for white: wealthier nations use significantly more white (ρ = +0.28).

Bubble chart: GDP, life expectancy, green, and palette complexity
# ---- Bubble chart: multi-dimensional view ----
v_bub = df_full.dropna(subset=["gdp_per_capita", "life_expectancy"]).copy()
v_bub["log_gdp"] = np.log10(v_bub["gdp_per_capita"])

fig = px.scatter(
    v_bub,
    x="log_gdp",
    y="life_expectancy",
    size="palette_complexity",
    color="green_pct",
    hover_name="name",
    color_continuous_scale="Greens",
    size_max=18,
    title="Development, Complexity, and the Color Green",
    labels={"log_gdp": "log₁₀(GDP per capita, USD)",
            "life_expectancy": "Life Expectancy (years)",
            "palette_complexity": "Palette Complexity",
            "green_pct": "Green %"},
    width=800,
    height=550,
)
fig.update_layout(margin=dict(t=50, b=50))
fig.show()

This bubble chart encodes four dimensions at once. Each flag is a circle whose position reflects its country’s wealth (x) and longevity (y), whose size reflects how many colors the flag uses, and whose shade of green reflects how much green is in the flag. The result is striking: the lower-left corner (poor, short-lived) is full of large, dark-green bubbles; the upper-right (rich, long-lived) holds small, pale bubbles. Development simplifies flags and drains their green.

12.9 Solar Determinism: latitude shapes the palette

The “Solar Determinism” hypothesis posits that proximity to the equator influences flag colors, specifically that tropical nations favor warm tones (yellow, green) and that polar nations favor cool, minimal designs. The data supports this, but moderately.

Violin plots: key color features by latitude band
# ---- Create latitude bands ----
v_lat = df_full.dropna(subset=["abs_latitude"]).copy()
v_lat["lat_band"] = pd.cut(v_lat["abs_latitude"],
    bins=[0, 15, 30, 45, 90],
    labels=["Tropical\n(0-15°)", "Subtropical\n(15-30°)",
            "Temperate\n(30-45°)", "High latitude\n(45°+)"])

# ---- Melt to long format for faceted violins ----
solar_features = ["yellow_pct", "green_pct", "white_pct", "vertical_dominance"]
v_long = v_lat.melt(
    id_vars=["code", "name", "lat_band"],
    value_vars=solar_features,
    var_name="feature",
    value_name="value",
)

import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=4, subplot_titles=[
    "Yellow %<br>(ρ = −0.30)", "Green %<br>(ρ = −0.29)",
    "White %<br>(ρ = +0.24)", "Vertical Dominance<br>(ρ = +0.22)"])

colors = ["#e6ab02", "#1b9e77", "#cccccc", "#7570b3"]
for i, feat in enumerate(solar_features):
    for j, band in enumerate(["Tropical\n(0-15°)", "Subtropical\n(15-30°)",
                               "Temperate\n(30-45°)", "High latitude\n(45°+)"]):
        vals = v_lat[v_lat["lat_band"] == band][feat].dropna()
        fig.add_trace(go.Violin(
            y=vals, name=band, line_color=colors[i],
            box_visible=True, meanline_visible=True,
            showlegend=False,
        ), row=1, col=i+1)

fig.update_layout(
    title="Solar Determinism: How Latitude Shapes the Flag Palette",
    height=450, width=900,
    margin=dict(t=80, b=40),
)
fig.show()

The violin plots tell a more nuanced story than a simple scatter with a trendline. Yellow and green show a clear downward gradient from tropical to high-latitude bands, with the widest distributions in the tropics (some tropical flags use 50%+ green; others use none). White increases with latitude, though the effect is modest. Vertical dominance increases sharply in the temperate and high-latitude bands, reflecting the European tricolor tradition. These are real effects (all p < 0.001), but they are moderate in size (|ρ| ~ 0.22-0.30); latitude explains maybe 5-9% of the variance in any single color feature.

12.10 Inequality and flag complexity

The Gini coefficient (income inequality) produces one of the more surprising results.

Spearman correlations: flag features vs. Gini coefficient
# ---- Features vs Gini (inequality) ----
v3 = df_full.dropna(subset=["gini"])
gini_rows = []
for f in feature_cols:
    rho, p = spearmanr(v3["gini"], v3[f])
    gini_rows.append({"feature": f, "rho": round(rho, 3), "p_value": p, "abs_rho": abs(rho)})

df_gini = (pd.DataFrame(gini_rows)
           .sort_values("abs_rho", ascending=False)
           .drop(columns="abs_rho")
           .reset_index(drop=True))
itshow(df_gini, pageLength=19)
Loading ITables v2.7.0 from the internet... (need help?)

Higher income inequality is associated with more palette complexity (ρ = +0.32, the strongest single Gini correlation), more yellow (ρ = +0.28), and more visual entropy (ρ = +0.27). In other words, more unequal societies fly more complex flags.

Palette complexity by Gini tercile
# ---- Gini terciles ----
v_gini = df_full.dropna(subset=["gini"]).copy()
v_gini["gini_tier"] = pd.qcut(v_gini["gini"], q=3,
    labels=["Low inequality\n(Gini < 33)", "Medium\n(33-40)", "High inequality\n(Gini > 40)"])

fig = px.violin(
    v_gini,
    x="gini_tier",
    y="palette_complexity",
    color="gini_tier",
    hover_name="name",
    box=True,
    points="all",
    title="More Inequality, More Colors on the Flag (ρ = +0.32)",
    labels={"gini_tier": "Income Inequality Tier", "palette_complexity": "Palette Complexity (distinct colors)"},
    width=700,
    height=500,
)
fig.update_layout(showlegend=False)
fig.show()

The violin plot reveals that the shift is gradual but real. Low-inequality nations (mostly European) cluster around 3-4 colors; high-inequality nations (mostly African, Latin American) spread from 3 to 8 colors. The most likely explanation is a geographic confound: high-Gini countries tend to be post-colonial, tropical nations whose flag traditions (Pan-African, Pan-American) emphasize multi-color symbolism. But the effect holds within regions too, which suggests that the confound does not explain everything. Whether complex societies produce complex symbols, or whether this is pure coincidence mediated by colonial history, is a question we cannot answer with 250 data points.

12.11 Population and the colonial blue

Population reveals the clearest non-geographic pattern.

Spearman correlations: flag features vs. log(population)
# ---- Features vs log(population) ----
v2 = df_full[df_full["population"] > 0].copy()
v2["log_pop"] = np.log10(v2["population"])

pop_rows = []
for f in feature_cols:
    rho, p = spearmanr(v2["log_pop"], v2[f])
    pop_rows.append({"feature": f, "rho": round(rho, 3), "p_value": p, "abs_rho": abs(rho)})

df_pop = (pd.DataFrame(pop_rows)
          .sort_values("abs_rho", ascending=False)
          .drop(columns="abs_rho")
          .reset_index(drop=True))
itshow(df_pop, pageLength=19)
Loading ITables v2.7.0 from the internet... (need help?)

Larger nations have less blue (ρ = -0.33), less edge density (ρ = -0.31), more symmetry (ρ = +0.28), and more red (ρ = +0.19). The blue-population link is the second strongest individual correlation in the entire analysis, and it has a simple explanation: small territories are disproportionately British dependencies that inherited complex blue-ensign flags, while large, independent nations designed their own, simpler, bolder symbols.

Strip plot: blue percentage by population quartile
# ---- Population quartiles ----
v2["pop_tier"] = pd.qcut(v2["log_pop"], q=4,
    labels=["Smallest 25%\n(< 30K)", "25-50%\n(30K-1M)",
            "50-75%\n(1M-15M)", "Largest 25%\n(> 15M)"])

fig = px.strip(
    v2,
    x="pop_tier",
    y="blue_pct",
    hover_name="name",
    color="region",
    title="Small Territories Fly Blue: Population vs. Blue % (ρ = −0.33)",
    labels={"pop_tier": "Population Quartile", "blue_pct": "Blue %"},
    width=750,
    height=500,
)
fig.update_traces(marker=dict(size=7, opacity=0.7))
fig.update_layout(legend_title_text="Region")
fig.show()

The smallest population quartile is packed with Oceanian and Caribbean blue-ensign territories (Anguilla, Montserrat, Tuvalu, Cook Islands). The largest quartile holds the major nations: China, India, Indonesia, Brazil, Nigeria, none of which have significant blue in their flags. This is one of the clearest examples in the dataset of colonial history leaving a measurable trace in flag design.

12.12 The honest null results

Not everything correlates with everything. Several hypotheses that sound plausible turn out to be completely unsupported by the data:

Confirmed null correlations
# ---- Test and display the nulls honestly ----
null_tests = [
    ("forest_pct", "green_pct", "Forest cover vs. green in flag"),
    ("annual_precip_mm", "palette_complexity", "Precipitation vs. palette complexity"),
    ("annual_sunshine_hrs", "blue_pct", "Sunshine hours vs. blue in flag"),
    ("life_expectancy", "aggression_index", "Life expectancy vs. aggression index"),
    ("avg_temp_c", "aggression_index", "Temperature vs. aggression index"),
    ("avg_temp_c", "yellow_pct", "Temperature vs. yellow in flag"),
]

null_rows = []
for ext, feat, label in null_tests:
    v = df_full.dropna(subset=[ext])
    rho, p = spearmanr(v[ext], v[feat])
    null_rows.append({
        "Hypothesis": label,
        "ρ": round(rho, 3),
        "p-value": round(p, 4),
        "Verdict": "No correlation" if abs(rho) < 0.15 or p >= 0.01
                   else "Weak" if abs(rho) < 0.20
                   else "Moderate",
    })

df_null = pd.DataFrame(null_rows)
itshow(df_null, pageLength=10)
Loading ITables v2.7.0 from the internet... (need help?)

These nulls are worth spelling out:

  • Forest cover vs. green in the flag: ρ = -0.09, p = 0.20. Brazil uses lots of green and has lots of forest; Saudi Arabia uses lots of green and has almost none. Flag green is about Islam, Pan-Africanism, and national ideology, not ecology.
  • Precipitation vs. flag complexity: ρ = 0.05, p = 0.44. Rainy countries do not have more colorful flags. Period.
  • Sunshine vs. blue: ρ = -0.01, p = 0.82. Completely flat. The “sunny countries avoid blue” story is a myth.
  • Temperature vs. yellow: ρ = +0.18, p = 0.004. Statistically significant but weak. The latitude version of this hypothesis (ρ = -0.30 for |latitude| vs yellow) is stronger, suggesting that the “solar” effect operates through geography and colonial history rather than through temperature directly.
  • Life expectancy and temperature vs. aggression: Both null (|ρ| ~ 0.11, p > 0.05). Hot countries do not have more aggressive flags. Long-lived countries do not have calmer flags. The aggression index does not correlate meaningfully with any external variable except population (ρ = +0.23).

12.13 Do clusters differ on climate and wealth?

Kruskal-Wallis tests: cluster membership vs. external variables
# ---- Test whether clusters differ significantly on each external variable ----
ext_kw_vars = ["avg_temp_c", "annual_precip_mm", "annual_sunshine_hrs",
               "gdp_per_capita", "life_expectancy", "forest_pct"]

kw_ext = []
for var in ext_kw_vars:
    valid = df_clust.dropna(subset=[var])
    groups = [g[var].values for _, g in valid.groupby("cluster")]
    groups = [g for g in groups if len(g) > 0]
    if len(groups) > 1:
        H, p = kruskal(*groups)
        N, k = len(valid), len(groups)
        eta_sq = (H - k + 1) / (N - k)
        kw_ext.append({
            "variable": var,
            "H_statistic": round(H, 1),
            "p_value": p,
            "eta_squared": round(eta_sq, 3),
            "significant": "Yes" if p < 0.01 else "No",
        })

df_kw_ext = pd.DataFrame(kw_ext).sort_values("eta_squared", ascending=False)
itshow(df_kw_ext, pageLength=10)
Loading ITables v2.7.0 from the internet... (need help?)
Average temperature by cluster
fig = px.box(
    df_clust.dropna(subset=["avg_temp_c"]),
    x="cluster",
    y="avg_temp_c",
    color="cluster",
    title="Temperature Distribution by Flag Cluster",
    labels={"cluster": "Cluster", "avg_temp_c": "Average Temperature (°C)"},
    width=850,
    height=500,
    category_orders={"cluster": sorted(df_clust["cluster"].unique())},
)
fig.update_layout(showlegend=False)
fig.update_xaxes(type="category")
fig.show()
GDP per capita by cluster
fig = px.box(
    df_clust.dropna(subset=["gdp_per_capita"]),
    x="cluster",
    y="gdp_per_capita",
    color="cluster",
    title="GDP per Capita by Flag Cluster",
    labels={"cluster": "Cluster", "gdp_per_capita": "GDP per Capita (USD)"},
    width=850,
    height=500,
    category_orders={"cluster": sorted(df_clust["cluster"].unique())},
)
fig.update_layout(showlegend=False)
fig.update_xaxes(type="category")
fig.show()

The cluster-level boxplots add texture to the feature-level correlations. Temperature varies dramatically across clusters: some clusters sit entirely in the tropics (medians above 25°C), while others are firmly temperate (medians below 15°C). GDP tells a similar story, with the widest spread in the blue-ensign cluster (which includes both wealthy Australia and tiny, impoverished dependencies). These differences confirm that flag design families are not random visual accidents; they are correlated with real-world geography and economics, mediated by the colonial and cultural forces that shaped both a nation’s development and its flag.

12.14 Synthesis: what actually drives flag design?

Summary of all effect sizes: which external variable matters most?
# ---- Collect all Spearman correlations into a single summary ----
all_corrs = []
summary_vars = {
    "abs_latitude": "Absolute Latitude",
    "avg_temp_c": "Avg. Temperature",
    "annual_precip_mm": "Annual Precipitation",
    "annual_sunshine_hrs": "Sunshine Hours",
    "gini": "Gini (Inequality)",
    "gdp_per_capita": "GDP per Capita*",
    "life_expectancy": "Life Expectancy",
    "forest_pct": "Forest Cover",
    "population": "Population*",
    "area_km2": "Area*",
}

for var, label in summary_vars.items():
    v = df_full.dropna(subset=[var]).copy()
    if var in ["gdp_per_capita", "population", "area_km2"]:
        v[var] = np.log10(v[var].clip(lower=1))
    rhos = []
    sig_count = 0
    for f in feature_cols:
        rho, p = spearmanr(v[var], v[f])
        rhos.append(abs(rho))
        if p < 0.01:
            sig_count += 1
    all_corrs.append({
        "Variable": label,
        "Mean |ρ|": round(np.mean(rhos), 3),
        "Max |ρ|": round(np.max(rhos), 3),
        "# significant (p<0.01)": sig_count,
    })

df_effects = pd.DataFrame(all_corrs).sort_values("Mean |ρ|", ascending=False)
itshow(df_effects, pageLength=10)
Loading ITables v2.7.0 from the internet... (need help?)
Bar chart: mean effect size by external variable
fig = px.bar(
    df_effects.sort_values("Mean |ρ|"),
    x="Mean |ρ|",
    y="Variable",
    orientation="h",
    color="Max |ρ|",
    color_continuous_scale="Viridis",
    title="What Drives Flag Design? Effect Size by External Variable",
    labels={"Mean |ρ|": "Mean |Spearman ρ| Across 19 Features", "Variable": ""},
    width=750,
    height=450,
)
fig.update_layout(margin=dict(l=140, r=20, t=50, b=50))
fig.show()

The bar chart provides the final answer. Life expectancy and GDP per capita have the broadest influence across all 19 flag features, with mean |ρ| values above 0.15 and maximum individual correlations around 0.40. These are followed by population (driven by the colonial blue-ensign effect), absolute latitude (the solar determinism axis), and Gini inequality (flag complexity). At the bottom, forest cover, precipitation, and area show weak or null overall effects.

The story these numbers tell is clear. Flag design is not random, but it is not directly driven by physical environment either. The strongest predictors are all human variables: how wealthy a country is, how long its people live, how many people it contains, and how unequal their society is. Geography matters too, but primarily as a proxy for colonial history and cultural tradition. A flag is a cultural artifact shaped by the same forces that shape nations themselves: latitude determines climate, climate influenced colonial expansion, colonial expansion determined political boundaries, and political boundaries determined which flag traditions each new nation inherited or invented. The 19 dimensions of a flag encode, in miniature, the entire trajectory of the nation behind it.

Conclusion

We began with 250 rectangles of color and ended with a quantitative portrait of how nations represent themselves. The journey moved through three stages, each building on the last.

Stage 1: Feature extraction. We converted every flag into 19 numerical features across five families: color palette (8), color complexity (3), visual complexity (3), geometric structure (4), and aspect ratio (1). These features capture what a human can see at a glance, from the fraction of red pixels to the symmetry of the layout, and compress it into a form that algorithms can compare.

Stage 2: Structure discovery. In this notebook, we computed pairwise distances between all 250 flags using both hand-crafted features and deep learning embeddings from ResNet-50. UMAP projected the resulting space into two dimensions, revealing a landscape where Pan-African tricolors cluster away from Nordic crosses, and the blue ensigns of former British territories form their own archipelago. HDBSCAN identified 12 stable clusters, with roughly 70% of flags assigned to a group and the remainder classified as noise, flags too unique to fit neatly into any family.

Stage 3: Hypothesis testing. We tested whether the patterns in flag space reflect patterns in the real world. The answer is nuanced. The strongest signals are human: GDP per capita and life expectancy correlate with flag simplicity (ρ ≈ 0.30–0.40), and population size predicts the presence of blue through colonial inheritance. Latitude shapes the palette, with equatorial nations using more warm tones and higher latitudes favoring cooler designs, but the effect is moderate (ρ ≈ −0.30). Meanwhile, several intuitive hypotheses turned out to be null: forest cover does not predict green, precipitation does not predict complexity, and sunshine hours have essentially zero correlation with blue.

12.15 What we learned

Three findings stand out:

  1. Colonial history is the dominant structuring force. Unsupervised clustering recovers the footprint of the British, French, and Spanish empires with no geographic input. The “Colonial Ghost” hypothesis is confirmed: flags inherit design traditions from their colonizers, and these traditions persist long after independence.

  2. Wealth simplifies, inequality complicates. Wealthier nations tend toward simpler, cooler flag designs, while more unequal societies use more colors and higher contrast. This mirrors a broader aesthetic pattern: the same minimalist impulse that drives modern corporate branding appears in the flags of developed nations.

  3. Physical environment is a weak predictor. Despite the romantic appeal of “solar determinism,” the direct effect of climate on flag design is modest. Latitude matters, but primarily because it correlates with colonial history and development level. The causal chain runs through human institutions, not sunlight.

12.16 Limitations

This analysis has clear boundaries. The 19 features, though diverse, are not exhaustive: they do not capture symbols (crescents, stars, coats of arms), text, or the semantic meaning of specific color choices. The deep learning embeddings partially compensate for this, but a dedicated symbol-detection pipeline would add a valuable dimension. The external metadata (GDP, Gini, life expectancy) are cross-sectional snapshots; flags, however, were designed at specific historical moments, and matching flag design to contemporary statistics introduces temporal mismatch. Finally, the sample of 250 flags, while comprehensive, is still small by machine learning standards, limiting the power of any individual statistical test.

12.17 Future directions

Several extensions suggest themselves. A longitudinal study could track how flags change when regimes change, testing whether revolutions produce measurable shifts in design features. Symbol detection via object recognition models could add a sixth feature family. And a generative model, trained on the feature distributions of each cluster, could answer the ultimate question: what would the flag of a country look like if all we knew was its latitude, GDP, and colonial history?

For now, the 19 dimensions are enough to show that flags are not arbitrary. They are data, compressed by history and encoded in cloth.