Data Processing
Complete documentation of how 5 ETO datasets containing 31 source files were cleaned, combined, harmonized, and transformed into 8 analysis-ready master datasets feeding the AI Supremacy Index scoring pipeline.
Advanced Semiconductor Supply Chain
5 source files → 1 master dataset (1,305 rows × 16 columns)
Source Files
| File | Rows | Cols | Role |
|---|---|---|---|
| inputs.csv | 126 | 10 | Catalog of chip production inputs (tools 90, materials 17, processes 11, designs 7) |
| providers.csv | 397 | 5 | Countries + organizations (includes aliases → 374 unique after dedup) |
| provision.csv | 1,305 | 7 | Core linkage: which providers supply which inputs, with market share % |
| sequence.csv | 139 | 6 | Supply chain relationships: 53 "goes into" + 86 "is type of" |
| stages.csv | 3 | 6 | Three production stages: Design, Fabrication, ATP |
Processing Pipeline
provider_id → 397 → 374 unique. Rename country → provider_hq_country.provider_id, adding provider_type and provider_hq_country.provided_id = input_id, adding input_type, stage_id, input_data_year, input_market_size.stage_id, adding production_stage (Design / Fabrication / ATP).provider_name (ISO code). For organizations, use provider_hq_country. This creates a unified country attribution column.provided_name → input_name, year → provision_year). Reorder to 16 analysis-ready columns.Quality Checks
Output Schema
semiconductor_master.csv — 1,305 rows × 16 columns ├── provider_name, provider_id, provider_type, provider_hq_country, effective_country ├── input_name, input_id, input_name_full, input_type ├── production_stage, stage_id, input_data_year ├── share_provided, provision_year └── input_market_size, source
Cross-Border Tech Research Collaborations
8 field-specific CSVs → 1 master dataset (21,118 rows × 8 columns)
Source Files (by field)
| File | Rows | Field | Note |
|---|---|---|---|
| Artificial_intelligence.csv | 10,569 | AI (general) | Largest — broadest coverage |
| Computer_vision.csv | 3,335 | Computer Vision | Mature subfield |
| Chip_design_and_fabrication.csv | 2,418 | Chip Design | Strategically weighted 2.0× |
| Cybersecurity.csv | 1,709 | Cybersecurity | National security relevance |
| Robotics.csv | 1,561 | Robotics | Mature subfield |
| Natural_language_processing.csv | 921 | NLP | Broad applied AI |
| Large_language_models.csv | 432 | LLMs | Frontier AI, weighted 1.8× |
| AI_safety.csv | 181 | AI Safety | Smallest — emerging field, weighted 1.6× |
Processing Pipeline
country1, country2, field, year, num_articles, complete. Total: 21,118 rows.country_a, country_b with alphabetical ordering so (US, China) and (China, US) map to the same canonical pair. Prevents double-counting in analysis.complete=True (data considered reliable). Years 2024–2025 are incomplete and excluded from scoring. Complete range: 2015–2023.Key Data Characteristics
Country AI Activity Metrics
9 CSVs → 4 master datasets (unified + 3 pillar-level)
Source Files
| File | Rows | Pillar | Key Metrics |
|---|---|---|---|
| publications_yearly_articles.csv | 8,983 | Publications | num_articles by country × field × year |
| publications_yearly_citations.csv | ~8K | Publications | num_citations (no complete flag — lagged) |
| publications_yearly_highly_cited.csv | ~8K | Publications | Highly cited article counts |
| patents_yearly_applications.csv | 5,832 | Patents | num_patent_applications (complete thru 2020) |
| patents_yearly_grants.csv | ~5K | Patents | Granted patents (removed from scoring — grant rate bias) |
| companies_yearly_disclosed.csv | ~12K | Investment | Disclosed investment ($M) |
| companies_yearly_estimated.csv | 15,339 | Investment | Estimated investment ($M) — used for scoring |
| companies_yearly_num_transactions.csv | ~12K | Investment | Transaction counts |
| companies_yearly_num_companies.csv | ~12K | Investment | Active company counts |
Processing Pipeline
complete=True rows used for scoring. Publications: 2015–2023. Patents: 2015–2020 (5-year lag!). Investment: 2015–2024. Citations: no complete flag → use lagged window 2021–2023.has_publications, has_citations, has_patents, has_investment) BEFORE any NaN filling. Output: 189 countries × 10 columns.Critical Data Findings
complete flag for citations. Using lagged window (2021–2023) to avoid measuring data pipeline latency.Output Files
country_ai_unified_master.csv — 189 rows × 10 cols (one row per country, all pillars) country_ai_publications_master.csv — 8,983 rows (yearly articles by country × field) country_ai_patents_master.csv — 5,832 rows (yearly applications by country × field) country_ai_investment_master.csv — 15,339 rows (yearly estimated investment by country × field)
Private-Sector AI Indicators (PARAT)
5 source files → parat_master.csv + parat_country_agg.csv
Source Files
| File | Rows | Role |
|---|---|---|
| core.csv | 691 | Main metrics: AI pubs, patents, workforce, company metadata (HQ, stage, sector) |
| yearly_publication_counts.csv | ~6K | Disaggregated yearly publication & patent data per company |
| alias.csv | varies | Alternate company names for matching |
| ticker.csv | varies | Stock exchange symbols for public companies |
| id.csv | varies | Cross-references: LinkedIn, Crunchbase, ROR, PermID |
Processing Pipeline
core.csv with yearly_publication_counts.csv on company ID. Add alias and ticker for enrichment. Result: 691 companies × 71 columns.company_count, total_ai_pubs, total_ai_patents, total_ai_workers, total_tech_workers, avg_ai_pubs_per_company, top_conference_pubs, sp500_count, big_tech_count, genai_count.Workforce: AI workers. Tier 1: ≥3 companies with workforce data. Tier 2: has AI publications. Tier 3: company count only.Known Limitations
AGORA — AI Governance Archive
4 source files → agora_master.csv with NLP stance classification
Source Files
| File | Rows | Role |
|---|---|---|
| documents.csv | 973 | Core metadata: authority, status, dates, summaries, 77 tag columns |
| segments.csv | 8,116 | Sub-document segments with granular annotations |
| authorities.csv | 105 | Issuing bodies with jurisdiction and parent authority |
| collections.csv | 10 | Thematic groupings |
Processing Pipeline
documents on Authority = Name from authorities table. Adds Jurisdiction field for country mapping.enacted_docs (log-transformed), thematic_breadth (unique tags covered / 77), maturity_ratio (enacted / total docs).Document Status Breakdown
Country Name Harmonization
Unifying naming conventions across all 5 datasets into canonical forms
Each dataset uses different country naming conventions — ISO 3166 codes (semiconductor), full names with variants ("China (mainland)" in Country AI Activity), and standard full names (cross-border, PARAT, AGORA). Without explicit harmonization, cross-dataset joins fail silently and countries lose data across dimensions.
Key Mappings
| Canonical Name | Semiconductor | Cross-Border | Country AI | PARAT | AGORA |
|---|---|---|---|---|---|
| China | CHN | China | China (mainland) | China | China |
| United States | USA | United States | United States | United States | United States |
| South Korea | KOR | South Korea | South Korea | South Korea | — |
| Taiwan | TWN | Taiwan | Taiwan | Taiwan | — |
| Japan | JPN | Japan | Japan | Japan | — |
| Germany | DEU | Germany | Germany | Germany | — |
| Netherlands | NLD | Netherlands | Netherlands | Netherlands | — |
| United Kingdom | GBR | United Kingdom | United Kingdom | United Kingdom | — |
Harmonization Function
def harmonize_country(name):
if name in GROUP_ENTITIES: return None # Exclude 12 group entities
if name in ISO_TO_NAME: return ISO_TO_NAME[name] # CHN → China, USA → United States
if name in VARIANT_TO_CANONICAL: return VARIANT_TO_CANONICAL[name] # "China (mainland)" → China
if name == 'Various countries': return None # Semiconductor aggregates
return name # Already canonicalGroup Entities Excluded (12)
These exist in the Country AI Activity dataset as aggregate rows. Removed before any metric computation.
AISI Scoring Pipeline
7 dimensions → coverage-weighted composite → final rankings
Dimension Weights (v3)
| Dimension | Weight | Coverage | Sub-components | Data Source |
|---|---|---|---|---|
| Hardware Sovereignty | 25% | ~23 | Avg share (35%) + Peak share (30%) + Breadth/96 (35%) | Semiconductor |
| Research Capacity | 20% | ~192 | Log pub volume (35%) + Lagged citations (45%) + Z-scored growth (20%) | Country AI |
| Commercial Ecosystem | 20% | ~117 | Log investment (50%) + Log company count (30%) + Pub intensity (20%) | Country AI + PARAT |
| Innovation Output | 15% | ~73 | Log patent volume (45%) + Field diversity /11 (55%) | Country AI |
| Talent Base | 10% | 17 | Tiered fallback: Workforce → Publications → Company count | PARAT |
| Collaboration Network | 5% | ~94 | Partner div capped@54 (40%) + Strategic volume (35%) + Partner quality (25%) | Cross-Border |
| Governance Readiness | 5% | 2 | Log enacted docs (30%) + Breadth/77 (40%) + Maturity ratio (30%) | AGORA |
Log Transform Impact
| Metric | Raw Skew | Log Skew | Top/Median (Raw) | Top/Median (Log) |
|---|---|---|---|---|
| Publication volume | 8.3 | 0.2 | 1,127× | 2.6× |
| Patent volume | 6.9 | 0.8 | 4,987× | 3.4× |
| Estimated investment | 10.4 | 0.3 | 6,741× | 2.8× |
Coverage-Weighted Composite Formula
1. Composite Base = Σ (dimension_score × dimension_weight × dimension_coverage)
─────────────────────────────────────────────────────────
Σ (dimension_weight × dimension_coverage)
2. Final AISI Score = Composite Base × Breadth Multiplier
Where Breadth Multiplier = 0.75 + 0.25 × (dims_scored / 7)Bug Fixes Applied
0.75 + 0.25 × (dims/7) that scales the raw score based on data depth.Final Output Files
AISI_Final_Rankings.csv — All ~195 countries, 7 dimensions + composite AISI_Final_Rankings_v2_no_governance.csv — 6-dimension variant (governance excluded) AISI_High_Confidence_Rankings.csv — Filtered to ≥50% data coverage only