Data Processing

Complete documentation of how 5 ETO datasets containing 31 source files were cleaned, combined, harmonized, and transformed into 8 analysis-ready master datasets feeding the AI Supremacy Index scoring pipeline.

31
Source Files
across 5 datasets
8
Master Outputs
analysis-ready CSVs
~50K
Total Rows
processed
195
Countries
after harmonization
7
Dimensions
scored per country
End-to-End Pipeline
5 Raw Datasets31 CSV FilesClean & CombineHarmonize Names8 Master Files7 DimensionsAISI Score
1

Advanced Semiconductor Supply Chain

5 source files → 1 master dataset (1,305 rows × 16 columns)

5
Source Files
126
Unique Inputs
tools, materials, processes
374
Organizations
across 21 countries
1,305
Master Rows
provider–input pairs
96
Inputs w/ Share
breadth denominator

Source Files

FileRowsColsRole
inputs.csv12610Catalog of chip production inputs (tools 90, materials 17, processes 11, designs 7)
providers.csv3975Countries + organizations (includes aliases → 374 unique after dedup)
provision.csv1,3057Core linkage: which providers supply which inputs, with market share %
sequence.csv1396Supply chain relationships: 53 "goes into" + 86 "is type of"
stages.csv36Three production stages: Design, Fabrication, ATP

Processing Pipeline

Step 1 — Deduplicate Providers: providers.csv contains alias rows. Drop duplicates on provider_id → 397 → 374 unique. Rename countryprovider_hq_country.
Step 2 — Join Provision + Providers: Left join on provider_id, adding provider_type and provider_hq_country.
Step 3 — Join Input Metadata: Left join on provided_id = input_id, adding input_type, stage_id, input_data_year, input_market_size.
Step 4 — Join Stage Metadata: Left join on stage_id, adding production_stage (Design / Fabrication / ATP).
Step 5 — Derive Effective Country: For country-type providers, use provider_name (ISO code). For organizations, use provider_hq_country. This creates a unified country attribution column.
Step 6 — Final Cleanup & Reorder: Drop redundant columns, rename for clarity (provided_nameinput_name, yearprovision_year). Reorder to 16 analysis-ready columns.

Quality Checks

Market share sums: Country-level shares per input average ~100% (median 100.0%). 0 outliers outside 90–110%.
"Various" entries: Aggregated small providers (<1% share each) — kept as-is, excluded from country-level analysis.
Missing shares: 30 inputs have no market share data → only 96 inputs used for breadth denominator.
HQ Attribution: All country assignments are HQ-based (e.g., TSMC → Taiwan even with US fabs). Flagged in methodology.

Output Schema

semiconductor_master.csv — 1,305 rows × 16 columns
├── provider_name, provider_id, provider_type, provider_hq_country, effective_country
├── input_name, input_id, input_name_full, input_type
├── production_stage, stage_id, input_data_year
├── share_provided, provision_year
└── input_market_size, source
2

Cross-Border Tech Research Collaborations

8 field-specific CSVs → 1 master dataset (21,118 rows × 8 columns)

8
Source Files
one per research field
21,118
Master Rows
country-pair × field × year
94
Countries
2015–2025
Year Range
complete through 2023
≥25
Min Threshold
articles per pair/year

Source Files (by field)

FileRowsFieldNote
Artificial_intelligence.csv10,569AI (general)Largest — broadest coverage
Computer_vision.csv3,335Computer VisionMature subfield
Chip_design_and_fabrication.csv2,418Chip DesignStrategically weighted 2.0×
Cybersecurity.csv1,709CybersecurityNational security relevance
Robotics.csv1,561RoboticsMature subfield
Natural_language_processing.csv921NLPBroad applied AI
Large_language_models.csv432LLMsFrontier AI, weighted 1.8×
AI_safety.csv181AI SafetySmallest — emerging field, weighted 1.6×

Processing Pipeline

Step 1 — Concatenate: Stack all 8 CSVs vertically. Identical schema: country1, country2, field, year, num_articles, complete. Total: 21,118 rows.
Step 2 — Standardize Pair Ordering: Create country_a, country_b with alphabetical ordering so (US, China) and (China, US) map to the same canonical pair. Prevents double-counting in analysis.
Step 3 — Completeness Filtering: Flag rows where complete=True (data considered reliable). Years 2024–2025 are incomplete and excluded from scoring. Complete range: 2015–2023.
Step 4 — Validation: Verify all country names are consistent (uses full names like "United States", "China"). No duplicate pairs per field/year. Minimum 25-article threshold already applied by source.

Key Data Characteristics

US–China axis dominates: 197K co-authored articles — 2.4× the next largest pair.
Growth trajectory: 39K articles in 2015 → 174K in 2023 (+342% overall).
Partner diversity: Mean 19.1 partners, median 12, 95th percentile = 54 (used as cap in AISI).
3

Country AI Activity Metrics

9 CSVs → 4 master datasets (unified + 3 pillar-level)

9
Source Files
3 publication + 2 patent + 4 investment
189
Countries
after excluding 12 groups
204
Raw Entities
including OECD, NATO, EU…
12
Groups Excluded
ASEAN, Five Eyes, Quad…

Source Files

FileRowsPillarKey Metrics
publications_yearly_articles.csv8,983Publicationsnum_articles by country × field × year
publications_yearly_citations.csv~8KPublicationsnum_citations (no complete flag — lagged)
publications_yearly_highly_cited.csv~8KPublicationsHighly cited article counts
patents_yearly_applications.csv5,832Patentsnum_patent_applications (complete thru 2020)
patents_yearly_grants.csv~5KPatentsGranted patents (removed from scoring — grant rate bias)
companies_yearly_disclosed.csv~12KInvestmentDisclosed investment ($M)
companies_yearly_estimated.csv15,339InvestmentEstimated investment ($M) — used for scoring
companies_yearly_num_transactions.csv~12KInvestmentTransaction counts
companies_yearly_num_companies.csv~12KInvestmentActive company counts

Processing Pipeline

Step 1 — Exclude Group Entities: Remove 12 aggregate entries: ASEAN, Africa, Asia, EU, Europe, Five Eyes, Global Partnership on AI, NATO, North America and the Caribbean, OECD, Oceania, Quad. These are not countries and distort country-level metrics. 204 → 192 entities.
Step 2 — Harmonize "China (mainland)": Country AI Activity uses "China (mainland)" while all other datasets use "China". Mapped to canonical "China" via lookup table. Also flagged: "is_group" boolean set before filtering.
Step 3 — Filter on Completeness: Only complete=True rows used for scoring. Publications: 2015–2023. Patents: 2015–2020 (5-year lag!). Investment: 2015–2024. Citations: no complete flag → use lagged window 2021–2023.
Step 4 — Build Pillar-Level Masters: Aggregate by country across complete years. Publications master: avg annual articles, fields covered. Patents master: total applications, field diversity (out of 11 fields). Investment master: total estimated investment ($M).
Step 5 — Build Unified Master: Join all 3 pillars on country name. Set boolean availability flags (has_publications, has_citations, has_patents, has_investment) BEFORE any NaN filling. Output: 189 countries × 10 columns.

Critical Data Findings

Extreme skewness: Publications skew=8.3, Patents skew=6.9, Investment skew=10.4. Log(1+x) reduces all to <1.0.
Patent lag: Complete patent data ends at 2020 — 5 years behind. Innovation Output measures pre-pandemic activity only.
Coverage varies: 192 countries have publications, but only 73 have patents and 117 have investment data. 65 countries have all 3 pillars.
Citation gap: No complete flag for citations. Using lagged window (2021–2023) to avoid measuring data pipeline latency.

Output Files

country_ai_unified_master.csv     — 189 rows × 10 cols  (one row per country, all pillars)
country_ai_publications_master.csv — 8,983 rows          (yearly articles by country × field)
country_ai_patents_master.csv      — 5,832 rows          (yearly applications by country × field)
country_ai_investment_master.csv   — 15,339 rows         (yearly estimated investment by country × field)
4

Private-Sector AI Indicators (PARAT)

5 source files → parat_master.csv + parat_country_agg.csv

691
Companies
startups to multinationals
17
Countries
85% US-based
63
Core Columns
publications, patents, workforce
71
Master Columns
after enrichment

Source Files

FileRowsRole
core.csv691Main metrics: AI pubs, patents, workforce, company metadata (HQ, stage, sector)
yearly_publication_counts.csv~6KDisaggregated yearly publication & patent data per company
alias.csvvariesAlternate company names for matching
ticker.csvvariesStock exchange symbols for public companies
id.csvvariesCross-references: LinkedIn, Crunchbase, ROR, PermID

Processing Pipeline

Step 1 — Load & Merge: Join core.csv with yearly_publication_counts.csv on company ID. Add alias and ticker for enrichment. Result: 691 companies × 71 columns.
Step 2 — Flag Key Attributes: Identify company categories — S&P 500, Global Big Tech, GenAI Contenders. Flag companies with zero AI publications (35%) and zero AI patents (38%).
Step 3 — Country-Level Aggregation: Group by HQ country. Compute: company_count, total_ai_pubs, total_ai_patents, total_ai_workers, total_tech_workers, avg_ai_pubs_per_company, top_conference_pubs, sp500_count, big_tech_count, genai_count.
Step 4 — Talent Tier Assignment: For each country, count companies with non-zero Workforce: AI workers. Tier 1: ≥3 companies with workforce data. Tier 2: has AI publications. Tier 3: company count only.

Known Limitations

US dominance: 85% of companies are US-based. Workforce data (LinkedIn-derived) is essentially US-only — LinkedIn blocked in China/Russia.
Curated, not exhaustive: PARAT is a selected set of 691 companies, not a census. Biased toward well-known firms.
Maturity skew: 93% of companies are at "mature" stage. Startup representation is limited.
5

AGORA — AI Governance Archive

4 source files → agora_master.csv with NLP stance classification

973
Documents
laws, regulations, standards
77
Taxonomy Tags
5 domains
91%
US-Focused
882 of 973 documents
2
Scoreable Countries
US (390 enacted) + China (18)

Source Files

FileRowsRole
documents.csv973Core metadata: authority, status, dates, summaries, 77 tag columns
segments.csv8,116Sub-document segments with granular annotations
authorities.csv105Issuing bodies with jurisdiction and parent authority
collections.csv10Thematic groupings

Processing Pipeline

Step 1 — Merge Documents + Authorities: Left join documents on Authority = Name from authorities table. Adds Jurisdiction field for country mapping.
Step 2 — Map to Countries: Sub-national authorities (California, NYC) → "United States". Exclude "Multinational" and "Other countries" (cannot be mapped to individual nations). Only US and China are individually scoreable.
Step 3 — NLP Stance Classification: Classify each document as Enabling Restrictive Neutral or Balanced using a dual-signal approach:
Tag-based scoring (2× weight): Map all 77 taxonomy tags to stance indicators. Sum pro-enabling vs pro-restrictive tag counts.
Keyword analysis (1× weight): Scan document summaries for stance-indicating keywords (e.g., "promote", "fund" → enabling; "restrict", "prohibit" → restrictive).
Result: 346 Enabling, 293 Restrictive, 191 Neutral, 137 Balanced, 6 Unclassified.
Step 4 — Boolean Tag Normalization: Convert all 77 tag columns to proper booleans (handling mixed TRUE/"True"/1 formats). Required for thematic breadth calculation.
Step 5 — Governance Metrics: Per country: enacted_docs (log-transformed), thematic_breadth (unique tags covered / 77), maturity_ratio (enacted / total docs).

Document Status Breakdown

450 Enacted — active law/regulation
400 Defunct — expired or superseded
123 Proposed — pending legislation
6

Country Name Harmonization

Unifying naming conventions across all 5 datasets into canonical forms

Each dataset uses different country naming conventions — ISO 3166 codes (semiconductor), full names with variants ("China (mainland)" in Country AI Activity), and standard full names (cross-border, PARAT, AGORA). Without explicit harmonization, cross-dataset joins fail silently and countries lose data across dimensions.

Key Mappings

Canonical NameSemiconductorCross-BorderCountry AIPARATAGORA
ChinaCHNChinaChina (mainland)ChinaChina
United StatesUSAUnited StatesUnited StatesUnited StatesUnited States
South KoreaKORSouth KoreaSouth KoreaSouth Korea
TaiwanTWNTaiwanTaiwanTaiwan
JapanJPNJapanJapanJapan
GermanyDEUGermanyGermanyGermany
NetherlandsNLDNetherlandsNetherlandsNetherlands
United KingdomGBRUnited KingdomUnited KingdomUnited Kingdom

Harmonization Function

def harmonize_country(name):
    if name in GROUP_ENTITIES:     return None   # Exclude 12 group entities
    if name in ISO_TO_NAME:        return ISO_TO_NAME[name]   # CHN → China, USA → United States
    if name in VARIANT_TO_CANONICAL: return VARIANT_TO_CANONICAL[name]   # "China (mainland)" → China
    if name == 'Various countries': return None   # Semiconductor aggregates
    return name   # Already canonical

Group Entities Excluded (12)

ASEANAfricaAsiaEUEuropeFive EyesGPAINATON. America & CaribbeanOECDOceaniaQuad

These exist in the Country AI Activity dataset as aggregate rows. Removed before any metric computation.

7

AISI Scoring Pipeline

7 dimensions → coverage-weighted composite → final rankings

Dimension Weights (v3)

DimensionWeightCoverageSub-componentsData Source
Hardware Sovereignty25%~23Avg share (35%) + Peak share (30%) + Breadth/96 (35%)Semiconductor
Research Capacity20%~192Log pub volume (35%) + Lagged citations (45%) + Z-scored growth (20%)Country AI
Commercial Ecosystem20%~117Log investment (50%) + Log company count (30%) + Pub intensity (20%)Country AI + PARAT
Innovation Output15%~73Log patent volume (45%) + Field diversity /11 (55%)Country AI
Talent Base10%17Tiered fallback: Workforce → Publications → Company countPARAT
Collaboration Network5%~94Partner div capped@54 (40%) + Strategic volume (35%) + Partner quality (25%)Cross-Border
Governance Readiness5%2Log enacted docs (30%) + Breadth/77 (40%) + Maturity ratio (30%)AGORA

Log Transform Impact

MetricRaw SkewLog SkewTop/Median (Raw)Top/Median (Log)
Publication volume8.30.21,127×2.6×
Patent volume6.90.84,987×3.4×
Estimated investment10.40.36,741×2.8×

Coverage-Weighted Composite Formula

1. Composite Base = Σ (dimension_score × dimension_weight × dimension_coverage)
                    ─────────────────────────────────────────────────────────
                               Σ (dimension_weight × dimension_coverage)

2. Final AISI Score = Composite Base × Breadth Multiplier

Where Breadth Multiplier = 0.75 + 0.25 × (dims_scored / 7)
Intuition: A country's score is first calculated as the weighted average of its available dimension data. This Composite Base is then adjusted by a Breadth Multiplier (ranging from 0.75 to 1.0) based on how many of the 7 dimensions have data. A country with data in only 1/7 dimensions is penalized significantly (~22% reduction), whereas a country with full 7/7 coverage keeps 100% of its base score.

Bug Fixes Applied

Bug #1 — NaN ≠ Zero: Countries absent from semiconductor/collaboration/governance datasets were treated as NaN (missing), not zero (genuine absence). The coverage-weighted formula rewarded them with higher scores for having fewer dimensions. Fix: Fill genuine-zero dimensions with 0 before scoring.
Bug #2 — Breadth Gaming: Countries scoring on only 2–3 dimensions could outscore broader profiles because the coverage denominator was smaller. Fix: Added a Breadth Multiplier 0.75 + 0.25 × (dims/7) that scales the raw score based on data depth.

Final Output Files

AISI_Final_Rankings.csv             — All ~195 countries, 7 dimensions + composite
AISI_Final_Rankings_v2_no_governance.csv — 6-dimension variant (governance excluded)
AISI_High_Confidence_Rankings.csv   — Filtered to ≥50% data coverage only