WIMS Entity Mapper

GLEIF Golden Copy + Wikidata Integration — Results Report · 2026-05-31 GLEIF 3.07 M records Wikidata 272 companies LEI similarity guard applied

Dry-Run Comparison — 1 000 Trademark Holders

Metric

Baseline (no enrichment)

After GLEIF + Wikidata

Holders matched

559 / 1 000 (55.9 %)

665 / 1 000 ▲ +106 +19 %

No-match

441 (44.1 %)

335 ▼ −106 −24 %

Via AI (of matched)

475 (85 %)

314 ▼ −161 −34 % (47 %)

Via LEI / GLEIF

43 unique holders avg conf 0.865

Processing speed

17 / min

~35 / min ▲ 2.1×

Time for 1 000 holders

58.2 min

28.9 min ▼ −29 min

* "After" figures are from a re-run with the LEI similarity guard (≥ 65 % token_sort_ratio) already applied. An intermediate unfiltered run showed 133 raw LEI hits with a 68 % false-positive rate; the guard reduced this to 56 match rows (43 unique holders), preserving precision.

Match Method Breakdown — Fixed Run

Bar width = share of total match rows (1 153 rows for 665 unique holders).

exact

502

502 avg 0.997

477

477 rows avg 0.795 · 314 unique holders

fuzzy

87 avg 0.784

lei ✦

56 rows avg 0.865 · 43 unique holders

dba

31 avg 0.965

Baseline (no enrichment): exact 407 · ai 763 rows · fuzzy 79 · dba 25 · lei 0 — 559 matched, 85 % via AI.

LEI Matching — Quality Analysis & Fix

⚠ False Positives Detected — v1 LEI Matching

The initial LEI FTS bridge produced 133 match rows (673 unique holders matched), but analysis revealed 90 were false positives (68 %). Generic industry tokens — vape, electronic, cigarettes — caused the FTS to surface GLEIF records for unrelated companies sharing a keyword. Example: "180 Vape, LLC" was bridged to "Ziizai" at 14 % name similarity through a GLEIF entity that happened to contain "vape".

Fix applied — src/match/lei.js: a GLEIF record is only accepted as a bridge when token_sort_ratio(holder_name, gleif_legal_name) ≥ 65 %. This drops cross-industry noise while preserving legitimate subsidiary resolutions such as "Canopy Growth USA, LLC" → "Canopy Growth Corp." (93 %) and "NICOVENTURES RETAIL (UK) LIMITED" → "Nicoventures Holdings, Ltd." (71 %).

raw LEI matches (v1)

133

68 % false positive rate

no similarity guard

filtered LEI matches (fixed)

43 unique holders

≥ 65 % name similarity required

LEI avg confidence

0.865

higher than AI (0.795)

deterministic, authoritative source

Data Enrichment Summary

GLEIF Golden Copy

Concatenated File XML · 2026-05-30 · 7.6 GB · streamed in 16 min

3,075,680 ACTIVE legal entities loaded
2,804,580 with registration numbers (91 %)
487,104 with alternative / transliterated names
FTS5 index with unicode61 tokeniser
Top countries: IN · US · DE · IT · GB · ES · NL · FR
Streaming XML parser — no extra npm dependencies

Wikidata Enrichment

wbsearchentities + wbgetentities API · 101.8 min · 3 workers

13,762 companies searched
272 Wikidata QID matches (2 % hit rate)
English skos:altLabels merged into company aliases
LEI codes (P1320) collected where available
Country codes resolved via P17 → P297 claim chain
Sentinel rows prevent re-querying on future runs

unique tokens in company index

12,938

+312 vs baseline (12,626)

from Wikidata + GLEIF alias merging

companies enriched

906

gained extra aliases (6.6 %)

of 13,762 total companies

GLEIF records in local DB

3.07 M

full golden copy, ACTIVE only

enables offline, zero-latency LEI lookup

AI calls per 1 000 holders

−161

475 → 314 (−34 %)

resolved by deterministic steps instead

Updated Match Pipeline

Exact name

→

DBA / alias

→

LEI (GLEIF 3 M) ✦ new

→

Fuzzy (FTS + fuzzball)

→

AI (Ollama llama3.1)

The LEI step performs a two-hop resolution: holder name → GLEIF FTS → canonical LEI name → company index. The similarity guard (≥ 65 % token_sort_ratio) prevents generic industry terms from creating false bridges. Both match:tm and match:patents pipelines include this step. When the local GLEIF table is empty the step is a near-zero-cost no-op.

Full-Pipeline Throughput Projection (3.5 M Holders)

Before (baseline)

~143 days

at 17 holders / min

85 % of matches bottlenecked by AI

After (GLEIF + Wikidata)

~69 days

at ~35 holders / min

47 % via AI · 43 LEI per 1 000 · 2.1× faster

GPU-hours saved

~74 days

at same throughput

proportional AI cost reduction

Match rate improvement

55.9 % → 66.5 %

+10.6 pp (+19 %)

extrapolated to full holder set