WIMS Entity Mapper

GLEIF Golden Copy + Wikidata Integration — Results Report  ·  2026-05-31 GLEIF 3.07 M records Wikidata 272 companies LEI similarity guard applied

Dry-Run Comparison — 1 000 Trademark Holders

Metric
Baseline (no enrichment)
After GLEIF + Wikidata
Holders matched
559 / 1 000  (55.9 %)
665 / 1 000 ▲ +106  +19 %
No-match
441  (44.1 %)
335 ▼ −106  −24 %
Via AI (of matched)
475  (85 %)
314 ▼ −161  −34 % (47 %)
Via LEI / GLEIF
0
43 unique holders  avg conf 0.865
Processing speed
17 / min
~35 / min ▲ 2.1×
Time for 1 000 holders
58.2 min
28.9 min ▼ −29 min

* "After" figures are from a re-run with the LEI similarity guard (≥ 65 % token_sort_ratio) already applied. An intermediate unfiltered run showed 133 raw LEI hits with a 68 % false-positive rate; the guard reduced this to 56 match rows (43 unique holders), preserving precision.

Match Method Breakdown — Fixed Run

Bar width = share of total match rows  (1 153 rows for 665 unique holders).

exact
502
502  avg 0.997
ai
477
477 rows  avg 0.795 · 314 unique holders
fuzzy
87
87  avg 0.784
lei ✦
56
56 rows  avg 0.865 · 43 unique holders
dba
31
31  avg 0.965

Baseline (no enrichment): exact 407 · ai 763 rows · fuzzy 79 · dba 25 · lei 0 — 559 matched, 85 % via AI.

LEI Matching — Quality Analysis & Fix

⚠ False Positives Detected — v1 LEI Matching

The initial LEI FTS bridge produced 133 match rows (673 unique holders matched), but analysis revealed 90 were false positives (68 %). Generic industry tokens — vape, electronic, cigarettes — caused the FTS to surface GLEIF records for unrelated companies sharing a keyword. Example: "180 Vape, LLC" was bridged to "Ziizai" at 14 % name similarity through a GLEIF entity that happened to contain "vape".

Fix appliedsrc/match/lei.js: a GLEIF record is only accepted as a bridge when token_sort_ratio(holder_name, gleif_legal_name) ≥ 65 %. This drops cross-industry noise while preserving legitimate subsidiary resolutions such as "Canopy Growth USA, LLC""Canopy Growth Corp." (93 %) and "NICOVENTURES RETAIL (UK) LIMITED""Nicoventures Holdings, Ltd." (71 %).

raw LEI matches (v1)
133
68 % false positive rate
no similarity guard
filtered LEI matches (fixed)
56
43 unique holders
≥ 65 % name similarity required
LEI avg confidence
0.865
higher than AI (0.795)
deterministic, authoritative source

Data Enrichment Summary

GLEIF Golden Copy

Concatenated File XML · 2026-05-30 · 7.6 GB · streamed in 16 min
  • 3,075,680 ACTIVE legal entities loaded
  • 2,804,580 with registration numbers (91 %)
  • 487,104 with alternative / transliterated names
  • FTS5 index with unicode61 tokeniser
  • Top countries: IN · US · DE · IT · GB · ES · NL · FR
  • Streaming XML parser — no extra npm dependencies

Wikidata Enrichment

wbsearchentities + wbgetentities API · 101.8 min · 3 workers
  • 13,762 companies searched
  • 272 Wikidata QID matches (2 % hit rate)
  • English skos:altLabels merged into company aliases
  • LEI codes (P1320) collected where available
  • Country codes resolved via P17 → P297 claim chain
  • Sentinel rows prevent re-querying on future runs
unique tokens in company index
12,938
+312 vs baseline (12,626)
from Wikidata + GLEIF alias merging
companies enriched
906
gained extra aliases (6.6 %)
of 13,762 total companies
GLEIF records in local DB
3.07 M
full golden copy, ACTIVE only
enables offline, zero-latency LEI lookup
AI calls per 1 000 holders
−161
475 → 314 (−34 %)
resolved by deterministic steps instead

Updated Match Pipeline

Exact name
DBA / alias
LEI (GLEIF 3 M) ✦ new
Fuzzy (FTS + fuzzball)
AI (Ollama llama3.1)

The LEI step performs a two-hop resolution: holder name → GLEIF FTS → canonical LEI name → company index. The similarity guard (≥ 65 % token_sort_ratio) prevents generic industry terms from creating false bridges. Both match:tm and match:patents pipelines include this step. When the local GLEIF table is empty the step is a near-zero-cost no-op.

Full-Pipeline Throughput Projection (3.5 M Holders)

Before (baseline)
~143 days
at 17 holders / min
85 % of matches bottlenecked by AI
After (GLEIF + Wikidata)
~69 days
at ~35 holders / min
47 % via AI · 43 LEI per 1 000 · 2.1× faster
GPU-hours saved
~74 days
at same throughput
proportional AI cost reduction
Match rate improvement
55.9 % → 66.5 %
+10.6 pp (+19 %)
extrapolated to full holder set