Data Processing

Complete documentation of how 5 ETO datasets containing 31 source files were cleaned, combined, harmonized, and transformed into 8 analysis-ready master datasets feeding the AI Supremacy Index scoring pipeline.

Source Files

across 5 datasets

Master Outputs

analysis-ready CSVs

~50K

Total Rows

processed

195

Countries

after harmonization

Dimensions

scored per country

End-to-End Pipeline

5 Raw Datasets→31 CSV Files→Clean & Combine→Harmonize Names→8 Master Files→7 Dimensions→AISI Score

Advanced Semiconductor Supply Chain

5 source files → 1 master dataset (1,305 rows × 16 columns)

Source Files

126

Unique Inputs

tools, materials, processes

374

Organizations

across 21 countries

1,305

Master Rows

provider–input pairs

Inputs w/ Share

breadth denominator

Source Files

File	Rows	Cols	Role
inputs.csv	126	10	Catalog of chip production inputs (tools 90, materials 17, processes 11, designs 7)
providers.csv	397	5	Countries + organizations (includes aliases → 374 unique after dedup)
provision.csv	1,305	7	Core linkage: which providers supply which inputs, with market share %
sequence.csv	139	6	Supply chain relationships: 53 "goes into" + 86 "is type of"
stages.csv	3	6	Three production stages: Design, Fabrication, ATP

Processing Pipeline

Step 1 — Deduplicate Providers: providers.csv contains alias rows. Drop duplicates on provider_id → 397 → 374 unique. Rename country → provider_hq_country.

Step 2 — Join Provision + Providers: Left join on provider_id, adding provider_type and provider_hq_country.

Step 3 — Join Input Metadata: Left join on provided_id = input_id, adding input_type, stage_id, input_data_year, input_market_size.

Step 4 — Join Stage Metadata: Left join on stage_id, adding production_stage (Design / Fabrication / ATP).

Step 5 — Derive Effective Country: For country-type providers, use provider_name (ISO code). For organizations, use provider_hq_country. This creates a unified country attribution column.

Step 6 — Final Cleanup & Reorder: Drop redundant columns, rename for clarity (provided_name → input_name, year → provision_year). Reorder to 16 analysis-ready columns.

Quality Checks

Market share sums: Country-level shares per input average ~100% (median 100.0%). 0 outliers outside 90–110%.

"Various" entries: Aggregated small providers (<1% share each) — kept as-is, excluded from country-level analysis.

Missing shares: 30 inputs have no market share data → only 96 inputs used for breadth denominator.

HQ Attribution: All country assignments are HQ-based (e.g., TSMC → Taiwan even with US fabs). Flagged in methodology.

Output Schema

semiconductor_master.csv — 1,305 rows × 16 columns
├── provider_name, provider_id, provider_type, provider_hq_country, effective_country
├── input_name, input_id, input_name_full, input_type
├── production_stage, stage_id, input_data_year
├── share_provided, provision_year
└── input_market_size, source

Cross-Border Tech Research Collaborations

8 field-specific CSVs → 1 master dataset (21,118 rows × 8 columns)

Source Files

one per research field

21,118

Master Rows

country-pair × field × year

Countries

2015–2025

Year Range

complete through 2023

≥25

Min Threshold

articles per pair/year

Source Files (by field)

File	Rows	Field	Note
Artificial_intelligence.csv	10,569	AI (general)	Largest — broadest coverage
Computer_vision.csv	3,335	Computer Vision	Mature subfield
Chip_design_and_fabrication.csv	2,418	Chip Design	Strategically weighted 2.0×
Cybersecurity.csv	1,709	Cybersecurity	National security relevance
Robotics.csv	1,561	Robotics	Mature subfield
Natural_language_processing.csv	921	NLP	Broad applied AI
Large_language_models.csv	432	LLMs	Frontier AI, weighted 1.8×
AI_safety.csv	181	AI Safety	Smallest — emerging field, weighted 1.6×

Processing Pipeline

Step 1 — Concatenate: Stack all 8 CSVs vertically. Identical schema: country1, country2, field, year, num_articles, complete. Total: 21,118 rows.

Step 2 — Standardize Pair Ordering: Create country_a, country_b with alphabetical ordering so (US, China) and (China, US) map to the same canonical pair. Prevents double-counting in analysis.

Step 3 — Completeness Filtering: Flag rows where complete=True (data considered reliable). Years 2024–2025 are incomplete and excluded from scoring. Complete range: 2015–2023.

Step 4 — Validation: Verify all country names are consistent (uses full names like "United States", "China"). No duplicate pairs per field/year. Minimum 25-article threshold already applied by source.

Key Data Characteristics

US–China axis dominates: 197K co-authored articles — 2.4× the next largest pair.

Growth trajectory: 39K articles in 2015 → 174K in 2023 (+342% overall).

Partner diversity: Mean 19.1 partners, median 12, 95th percentile = 54 (used as cap in AISI).

Country AI Activity Metrics

9 CSVs → 4 master datasets (unified + 3 pillar-level)

Source Files

3 publication + 2 patent + 4 investment

189

Countries

after excluding 12 groups

204

Raw Entities

including OECD, NATO, EU…

Groups Excluded

ASEAN, Five Eyes, Quad…

Source Files

File	Rows	Pillar	Key Metrics
publications_yearly_articles.csv	8,983	Publications	num_articles by country × field × year
publications_yearly_citations.csv	~8K	Publications	num_citations (no complete flag — lagged)
publications_yearly_highly_cited.csv	~8K	Publications	Highly cited article counts
patents_yearly_applications.csv	5,832	Patents	num_patent_applications (complete thru 2020)
patents_yearly_grants.csv	~5K	Patents	Granted patents (removed from scoring — grant rate bias)
companies_yearly_disclosed.csv	~12K	Investment	Disclosed investment ($M)
companies_yearly_estimated.csv	15,339	Investment	Estimated investment ($M) — used for scoring
companies_yearly_num_transactions.csv	~12K	Investment	Transaction counts
companies_yearly_num_companies.csv	~12K	Investment	Active company counts

Processing Pipeline

Step 1 — Exclude Group Entities: Remove 12 aggregate entries: ASEAN, Africa, Asia, EU, Europe, Five Eyes, Global Partnership on AI, NATO, North America and the Caribbean, OECD, Oceania, Quad. These are not countries and distort country-level metrics. 204 → 192 entities.

Step 2 — Harmonize "China (mainland)": Country AI Activity uses "China (mainland)" while all other datasets use "China". Mapped to canonical "China" via lookup table. Also flagged: "is_group" boolean set before filtering.

Step 3 — Filter on Completeness: Only complete=True rows used for scoring. Publications: 2015–2023. Patents: 2015–2020 (5-year lag!). Investment: 2015–2024. Citations: no complete flag → use lagged window 2021–2023.

Step 4 — Build Pillar-Level Masters: Aggregate by country across complete years. Publications master: avg annual articles, fields covered. Patents master: total applications, field diversity (out of 11 fields). Investment master: total estimated investment ($M).

Step 5 — Build Unified Master: Join all 3 pillars on country name. Set boolean availability flags (has_publications, has_citations, has_patents, has_investment) BEFORE any NaN filling. Output: 189 countries × 10 columns.

Critical Data Findings

Extreme skewness: Publications skew=8.3, Patents skew=6.9, Investment skew=10.4. Log(1+x) reduces all to <1.0.

Patent lag: Complete patent data ends at 2020 — 5 years behind. Innovation Output measures pre-pandemic activity only.

Coverage varies: 192 countries have publications, but only 73 have patents and 117 have investment data. 65 countries have all 3 pillars.

Citation gap: No complete flag for citations. Using lagged window (2021–2023) to avoid measuring data pipeline latency.

Output Files

country_ai_unified_master.csv     — 189 rows × 10 cols  (one row per country, all pillars)
country_ai_publications_master.csv — 8,983 rows          (yearly articles by country × field)
country_ai_patents_master.csv      — 5,832 rows          (yearly applications by country × field)
country_ai_investment_master.csv   — 15,339 rows         (yearly estimated investment by country × field)

Private-Sector AI Indicators (PARAT)

5 source files → parat_master.csv + parat_country_agg.csv

691

Companies

startups to multinationals

Countries

85% US-based

Core Columns

publications, patents, workforce

Master Columns

after enrichment

Source Files

File	Rows	Role
core.csv	691	Main metrics: AI pubs, patents, workforce, company metadata (HQ, stage, sector)
yearly_publication_counts.csv	~6K	Disaggregated yearly publication & patent data per company
alias.csv	varies	Alternate company names for matching
ticker.csv	varies	Stock exchange symbols for public companies
id.csv	varies	Cross-references: LinkedIn, Crunchbase, ROR, PermID

Processing Pipeline

Step 1 — Load & Merge: Join core.csv with yearly_publication_counts.csv on company ID. Add alias and ticker for enrichment. Result: 691 companies × 71 columns.

Step 2 — Flag Key Attributes: Identify company categories — S&P 500, Global Big Tech, GenAI Contenders. Flag companies with zero AI publications (35%) and zero AI patents (38%).

Step 3 — Country-Level Aggregation: Group by HQ country. Compute: company_count, total_ai_pubs, total_ai_patents, total_ai_workers, total_tech_workers, avg_ai_pubs_per_company, top_conference_pubs, sp500_count, big_tech_count, genai_count.

Step 4 — Talent Tier Assignment: For each country, count companies with non-zero Workforce: AI workers. Tier 1: ≥3 companies with workforce data. Tier 2: has AI publications. Tier 3: company count only.

Known Limitations

US dominance: 85% of companies are US-based. Workforce data (LinkedIn-derived) is essentially US-only — LinkedIn blocked in China/Russia.

Curated, not exhaustive: PARAT is a selected set of 691 companies, not a census. Biased toward well-known firms.

Maturity skew: 93% of companies are at "mature" stage. Startup representation is limited.

AGORA — AI Governance Archive

4 source files → agora_master.csv with NLP stance classification

973

Documents

laws, regulations, standards

Taxonomy Tags

5 domains

91%

US-Focused

882 of 973 documents

Scoreable Countries

US (390 enacted) + China (18)

Source Files

File	Rows	Role
documents.csv	973	Core metadata: authority, status, dates, summaries, 77 tag columns
segments.csv	8,116	Sub-document segments with granular annotations
authorities.csv	105	Issuing bodies with jurisdiction and parent authority
collections.csv	10	Thematic groupings

Processing Pipeline

Step 1 — Merge Documents + Authorities: Left join documents on Authority = Name from authorities table. Adds Jurisdiction field for country mapping.

Step 2 — Map to Countries: Sub-national authorities (California, NYC) → "United States". Exclude "Multinational" and "Other countries" (cannot be mapped to individual nations). Only US and China are individually scoreable.

Step 3 — NLP Stance Classification: Classify each document as Enabling Restrictive Neutral or Balanced using a dual-signal approach:

• Tag-based scoring (2× weight): Map all 77 taxonomy tags to stance indicators. Sum pro-enabling vs pro-restrictive tag counts.

• Keyword analysis (1× weight): Scan document summaries for stance-indicating keywords (e.g., "promote", "fund" → enabling; "restrict", "prohibit" → restrictive).

Result: 346 Enabling, 293 Restrictive, 191 Neutral, 137 Balanced, 6 Unclassified.

Step 4 — Boolean Tag Normalization: Convert all 77 tag columns to proper booleans (handling mixed TRUE/"True"/1 formats). Required for thematic breadth calculation.

Step 5 — Governance Metrics: Per country: enacted_docs (log-transformed), thematic_breadth (unique tags covered / 77), maturity_ratio (enacted / total docs).

Document Status Breakdown

450 Enacted — active law/regulation

400 Defunct — expired or superseded

123 Proposed — pending legislation

Country Name Harmonization

Unifying naming conventions across all 5 datasets into canonical forms

Each dataset uses different country naming conventions — ISO 3166 codes (semiconductor), full names with variants ("China (mainland)" in Country AI Activity), and standard full names (cross-border, PARAT, AGORA). Without explicit harmonization, cross-dataset joins fail silently and countries lose data across dimensions.

Key Mappings

Canonical Name	Semiconductor	Cross-Border	Country AI	PARAT	AGORA
China	CHN	China	China (mainland)	China	China
United States	USA	United States	United States	United States	United States
South Korea	KOR	South Korea	South Korea	South Korea	—
Taiwan	TWN	Taiwan	Taiwan	Taiwan	—
Japan	JPN	Japan	Japan	Japan	—
Germany	DEU	Germany	Germany	Germany	—
Netherlands	NLD	Netherlands	Netherlands	Netherlands	—
United Kingdom	GBR	United Kingdom	United Kingdom	United Kingdom	—

Harmonization Function

def harmonize_country(name):
    if name in GROUP_ENTITIES:     return None   # Exclude 12 group entities
    if name in ISO_TO_NAME:        return ISO_TO_NAME[name]   # CHN → China, USA → United States
    if name in VARIANT_TO_CANONICAL: return VARIANT_TO_CANONICAL[name]   # "China (mainland)" → China
    if name == 'Various countries': return None   # Semiconductor aggregates
    return name   # Already canonical

Group Entities Excluded (12)

ASEANAfricaAsiaEUEuropeFive EyesGPAINATON. America & CaribbeanOECDOceaniaQuad

These exist in the Country AI Activity dataset as aggregate rows. Removed before any metric computation.

AISI Scoring Pipeline

7 dimensions → coverage-weighted composite → final rankings

Dimension Weights (v3)

Dimension	Weight	Coverage	Sub-components	Data Source
Hardware Sovereignty	25%	~23	Avg share (35%) + Peak share (30%) + Breadth/96 (35%)	Semiconductor
Research Capacity	20%	~192	Log pub volume (35%) + Lagged citations (45%) + Z-scored growth (20%)	Country AI
Commercial Ecosystem	20%	~117	Log investment (50%) + Log company count (30%) + Pub intensity (20%)	Country AI + PARAT
Innovation Output	15%	~73	Log patent volume (45%) + Field diversity /11 (55%)	Country AI
Talent Base	10%	17	Tiered fallback: Workforce → Publications → Company count	PARAT
Collaboration Network	5%	~94	Partner div capped@54 (40%) + Strategic volume (35%) + Partner quality (25%)	Cross-Border
Governance Readiness	5%	2	Log enacted docs (30%) + Breadth/77 (40%) + Maturity ratio (30%)	AGORA

Log Transform Impact

Metric	Raw Skew	Log Skew	Top/Median (Raw)	Top/Median (Log)
Publication volume	8.3	0.2	1,127×	2.6×
Patent volume	6.9	0.8	4,987×	3.4×
Estimated investment	10.4	0.3	6,741×	2.8×

Coverage-Weighted Composite Formula

1. Composite Base = Σ (dimension_score × dimension_weight × dimension_coverage)
                    ─────────────────────────────────────────────────────────
                               Σ (dimension_weight × dimension_coverage)

2. Final AISI Score = Composite Base × Breadth Multiplier

Where Breadth Multiplier = 0.75 + 0.25 × (dims_scored / 7)

Intuition: A country's score is first calculated as the weighted average of its available dimension data. This Composite Base is then adjusted by a Breadth Multiplier (ranging from 0.75 to 1.0) based on how many of the 7 dimensions have data. A country with data in only 1/7 dimensions is penalized significantly (~22% reduction), whereas a country with full 7/7 coverage keeps 100% of its base score.

Bug Fixes Applied

Bug #1 — NaN ≠ Zero: Countries absent from semiconductor/collaboration/governance datasets were treated as NaN (missing), not zero (genuine absence). The coverage-weighted formula rewarded them with higher scores for having fewer dimensions. Fix: Fill genuine-zero dimensions with 0 before scoring.

Bug #2 — Breadth Gaming: Countries scoring on only 2–3 dimensions could outscore broader profiles because the coverage denominator was smaller. Fix: Added a Breadth Multiplier 0.75 + 0.25 × (dims/7) that scales the raw score based on data depth.

Final Output Files

AISI_Final_Rankings.csv             — All ~195 countries, 7 dimensions + composite
AISI_Final_Rankings_v2_no_governance.csv — 6-dimension variant (governance excluded)
AISI_High_Confidence_Rankings.csv   — Filtered to ≥50% data coverage only