GLEIF Golden Copy + Wikidata Integration — Results Report · 2026-05-31
GLEIF 3.07 M recordsWikidata 272 companiesLEI similarity guard applied
Dry-Run Comparison — 1 000 Trademark Holders
Metric
Baseline (no enrichment)
After GLEIF + Wikidata
Holders matched
559 / 1 000 (55.9 %)
665 / 1 000 ▲ +106 +19 %
No-match
441 (44.1 %)
335 ▼ −106 −24 %
Via AI (of matched)
475 (85 %)
314 ▼ −161 −34 %(47 %)
Via LEI / GLEIF
0
43 unique holders avg conf 0.865
Processing speed
17 / min
~35 / min ▲ 2.1×
Time for 1 000 holders
58.2 min
28.9 min ▼ −29 min
* "After" figures are from a re-run with the LEI similarity guard (≥ 65 % token_sort_ratio) already applied.
An intermediate unfiltered run showed 133 raw LEI hits with a 68 % false-positive rate; the guard reduced this to 56 match rows
(43 unique holders), preserving precision.
Match Method Breakdown — Fixed Run
Bar width = share of total match rows (1 153 rows for 665 unique holders).
exact
502
502 avg 0.997
ai
477
477 rows avg 0.795 · 314 unique holders
fuzzy
87
87 avg 0.784
lei ✦
56
56 rows avg 0.865 · 43 unique holders
dba
31
31 avg 0.965
Baseline (no enrichment): exact 407 · ai 763 rows · fuzzy 79 · dba 25 · lei 0 — 559 matched, 85 % via AI.
LEI Matching — Quality Analysis & Fix
⚠ False Positives Detected — v1 LEI Matching
The initial LEI FTS bridge produced 133 match rows (673 unique holders matched), but
analysis revealed 90 were false positives (68 %).
Generic industry tokens — vape, electronic, cigarettes —
caused the FTS to surface GLEIF records for unrelated companies sharing a keyword.
Example: "180 Vape, LLC" was bridged to "Ziizai" at 14 % name similarity
through a GLEIF entity that happened to contain "vape".
Fix applied — src/match/lei.js:
a GLEIF record is only accepted as a bridge when
token_sort_ratio(holder_name, gleif_legal_name) ≥ 65 %.
This drops cross-industry noise while preserving legitimate subsidiary resolutions
such as "Canopy Growth USA, LLC" → "Canopy Growth Corp."
(93 %) and "NICOVENTURES RETAIL (UK) LIMITED" → "Nicoventures Holdings, Ltd."
(71 %).
raw LEI matches (v1)
133
68 % false positive rate
no similarity guard
filtered LEI matches (fixed)
56
43 unique holders
≥ 65 % name similarity required
LEI avg confidence
0.865
higher than AI (0.795)
deterministic, authoritative source
Data Enrichment Summary
GLEIF Golden Copy
Concatenated File XML · 2026-05-30 · 7.6 GB · streamed in 16 min
3,075,680 ACTIVE legal entities loaded
2,804,580 with registration numbers (91 %)
487,104 with alternative / transliterated names
FTS5 index with unicode61 tokeniser
Top countries: IN · US · DE · IT · GB · ES · NL · FR
Streaming XML parser — no extra npm dependencies
Wikidata Enrichment
wbsearchentities + wbgetentities API · 101.8 min · 3 workers
13,762 companies searched
272 Wikidata QID matches (2 % hit rate)
English skos:altLabels merged into company aliases
LEI codes (P1320) collected where available
Country codes resolved via P17 → P297 claim chain
Sentinel rows prevent re-querying on future runs
unique tokens in company index
12,938
+312 vs baseline (12,626)
from Wikidata + GLEIF alias merging
companies enriched
906
gained extra aliases (6.6 %)
of 13,762 total companies
GLEIF records in local DB
3.07 M
full golden copy, ACTIVE only
enables offline, zero-latency LEI lookup
AI calls per 1 000 holders
−161
475 → 314 (−34 %)
resolved by deterministic steps instead
Updated Match Pipeline
Exact name
→
DBA / alias
→
LEI (GLEIF 3 M) ✦ new
→
Fuzzy (FTS + fuzzball)
→
AI (Ollama llama3.1)
The LEI step performs a two-hop resolution:
holder name → GLEIF FTS → canonical LEI name → company index.
The similarity guard (≥ 65 % token_sort_ratio) prevents generic industry terms from
creating false bridges. Both match:tm and match:patents pipelines include this step.
When the local GLEIF table is empty the step is a near-zero-cost no-op.
Full-Pipeline Throughput Projection (3.5 M Holders)