WIMS Entity Mapper — Cloud & Data Enrichment Summary

Generated: · Based on dry-run benchmark: 1,000 trademark holders, AI fallback enabled

Dry-Run Benchmark Results (1,000 holders, RTX 4060)

559

Total matched

475

Matched via AI (85%)

Exact / Fuzzy only

441

No match found

58.2 min

Total elapsed

17.2

Holders / min

The AI fallback (Ollama llama3.1 8B Q4_K_M) is responsible for 85% of all successful matches, confirming it is essential — not supplementary — to the pipeline. Without AI, the pipeline would have matched only 84 holders (8.4%) instead of 559 (55.9%).

The 441 no-match holders (44%) represent the ceiling for improvement. Most either have no name-token overlap with any company in the index (FTS returns nothing, so AI is never invoked), or the names are too semantically distant for current models. External data sources are the primary lever to shrink this group.

Remaining Workload (Local DB)

3.51M

Total holders in DB

140,916

Remaining to process

199,732

Already matched

~137 hrs

Est. time (current GPU)

Google Cloud — CPU-Only Deployment

Running Ollama on CPU eliminates the GPU requirement at the cost of significantly lower token generation throughput. The model (llama3.1 8B Q4_K_M, ~4.5 GB) fits in system RAM on any instance with ≥ 8 GB, but inference speed is bottlenecked by the memory bus rather than compute.

Performance Comparison

Hardware	Token/s	vs RTX 4060	Rate (holders/min)	Est. for 140K holders
RTX 4060 (current)	~55	1×	17.2	~137 h
n2-standard-8 (CPU, AVX2)	~6–8	~8–9×	~2.0	~1,174 h
n2-standard-4 (CPU, AVX2)	~4–6	~10–12×	~1.5	~1,566 h
Older CPU (no AVX2)	< 2	> 25×	< 0.7	> 3,300 h

CPU-only processing extends the 140K-holder job from ~6 days to ~49 days on a single instance. Parallelising across multiple VMs reduces elapsed time proportionally but does not reduce total cost.

Cost Estimate — CPU-Only (140,916 holders)

Instance	vCPU	RAM	On-demand /hr	Preemptible /hr	Duration	On-demand total	Preemptible total
`n2-standard-4`	4	16 GB	$0.19	$0.05	~1,566 h	$297	$78
`n2-standard-8`	8	32 GB	$0.39	$0.10	~1,174 h	$457	$117
`c2-standard-8` (compute-opt.)	8	32 GB	$0.42	$0.11	~1,000 h	$420	$110

Cost Estimate — GPU Instances (for comparison)

Instance	GPU	VRAM	On-demand /hr	Spot /hr	Duration	On-demand total	Spot total
`n1-standard-4` + T4	T4	16 GB	$0.35	$0.10	~118 h	$41	$12
`g2-standard-4` + L4	L4	24 GB	$0.70	$0.18	~78 h	$55	$14
`a2-highgpu-1g` + A100	A100	40 GB	$2.93	$0.88	~35 h	$103	$31

Key insight: A preemptible n1-standard-4 + T4 GPU instance costs ~$12 and finishes in 5 days — vs $78–117 and 49 days for CPU-only. GPU is cheaper and 10× faster. CPU-only is only justified when GPU quota is unavailable.

Practical CPU Recommendation

If GPU instances are unavailable, use n2-standard-8 preemptible with checkpointing (the pipeline already supports resume via MAX(source_id)). Run multiple instances across ID-range shards to parallelise. Expect ~$110 total for the current 140K backlog, or ~$800–1,200 if the full API stream adds another ~1M holders.

Minimum viable instance: n2-standard-4 (4 vCPU, 16 GB). The model occupies ~5 GB; the remaining 11 GB comfortably handles the OS, SQLite DB, and Node.js pipeline. AVX2 is required — all n2 instances support it.

Public Data Sources to Reduce Uncertain AI Matches

The 441 no-match holders (44% of the dry run) cannot be resolved by AI because FTS returns no candidates — the name shares no token with any indexed company. External registries can bridge this gap by adding officially-registered alternate names, transliterations, and cross-reference identifiers directly to the company index.

Tier 1 — Highest Impact, Free Bulk Download

GLEIF Golden Copy (LEI Register)

gleif.org · Free · Daily bulk download · ~2.2M entities

The authoritative global registry for Legal Entity Identifiers. Each record contains the official legal name, a transliterated name in Roman script (covers CJK, Cyrillic, Arabic), alternative business names, registration number, country, and LEI code.

Why it matters: Directly resolves Chinese transliteration mismatches (e.g. "雷岩" → "Leiyan") that caused DBA false negatives in the benchmark. LEI can be stored as a cross-reference key enabling definitive matching at 0.98 confidence, bypassing fuzzy and AI entirely.

Free Bulk CSV Transliterations Reg. numbers

Wikidata

wikidata.org · Free · SPARQL endpoint + dumps

Structured knowledge base with ~1M company entities. Properties include official name (P1448), aliases in 50+ languages, industry (P452), country (P17), official website, and links to national registries.

Why it matters: Strong coverage of brands, multinationals, and publicly-traded companies — exactly the entities that appear frequently as trademark holders. Alias lists resolve abbreviations and brand-name vs legal-name discrepancies.

Free SPARQL / dump Multilingual aliases

OpenCorporates

opencorporates.com · Free (non-commercial) / API · 200M+ registrations

Largest open database of company registrations, covering 140 jurisdictions. Includes official company name, registration number, jurisdiction, status, officers, and — crucially — alternative/trading names (DBAs) per filing.

Why it matters: DBA data is the primary gap in the current alias index. OpenCorporates bulk data would dramatically expand the aliases column with officially-registered trading names, reducing AI calls for DBA pairs.

Non-commercial free Bulk + API DBAs / trading names

Tier 2 — Domain-Specific (Trademark & Patent)

WIPO Global Brand Database

branddb.wipo.int · Free · Bulk via WIPO IP Portal

Trademark holder names normalized across 70+ national offices. Because WIPO already deduplicates holder names across filings, this is a pre-solved version of the exact problem the pipeline is tackling — holder name → legal entity.

Free Trademark-specific Deduped holders

USPTO TSDR

tsdr.uspto.gov · Free API · US trademarks

Full US trademark holder names with addresses and correspondent data. Cross-filing analysis links holders across applications, revealing parent/subsidiary relationships not visible in single-filing data.

Free API US-only

EPO PATSTAT

epo.org · Academic / commercial · Full patent applicant DB

Patent applicant names with EPO-standardized deduplication (person_name in Docdb tables). The EPO has already resolved many of the same name-variant ambiguities present in the patent applicant pipeline. PATSTAT Online offers SQL access; bulk download is available commercially.

Commercial Patent-specific Pre-deduped

Tier 3 — National Registries (Jurisdiction-Specific)

Registry	Jurisdiction	Access	Key data
Companies House	UK	Free API	Official name, reg. number, previous names, SIC codes
SEC EDGAR	US	Free bulk	Legal entity name, CIK, SIC, state of incorporation
Sirene (INSEE)	France	Free bulk	SIRET/SIREN, legal name, NAF code
Handelsregister	Germany	Limited API	HRB number, legal name, registered seat
ARCA / Rol	Romania	Free	CUI, legal name, status

Recommended Integration Order

#	Source	Effort	Expected AI call reduction	New match method
1	GLEIF	Low — one CSV ingest	High — fixes CJK transliterations + adds LEI key	`lei`
2	Wikidata	Low — batch SPARQL	Medium — resolves well-known brands & abbreviations	`wikidata`
3	OpenCorporates	Medium — license + ETL	High — DBA coverage across 140 jurisdictions	`dba` (enriched)
4	WIPO Brand DB	Medium	High for trademark-holder pipeline specifically	`wipo`
5	PATSTAT	High — large dataset, commercial	High for patent-applicant pipeline specifically	`patstat`

GLEIF is the single highest-ROI first step: one bulk CSV download (~1 GB compressed), a single ingest pipeline, and the LEI cross-reference immediately enables definitive matching at 0.98 confidence for any entity found — no fuzzy scoring or AI call needed. It also adds transliterated names to the alias index, directly addressing the Chinese DBA false-negative problem observed in the correctness benchmark.