WIMS Entity Mapper — Cloud & Data Enrichment Summary

Generated:  ·  Based on dry-run benchmark: 1,000 trademark holders, AI fallback enabled

Dry-Run Benchmark Results (1,000 holders, RTX 4060)

559
Total matched
475
Matched via AI (85%)
84
Exact / Fuzzy only
441
No match found
58.2 min
Total elapsed
17.2
Holders / min

The AI fallback (Ollama llama3.1 8B Q4_K_M) is responsible for 85% of all successful matches, confirming it is essential — not supplementary — to the pipeline. Without AI, the pipeline would have matched only 84 holders (8.4%) instead of 559 (55.9%).

The 441 no-match holders (44%) represent the ceiling for improvement. Most either have no name-token overlap with any company in the index (FTS returns nothing, so AI is never invoked), or the names are too semantically distant for current models. External data sources are the primary lever to shrink this group.

Remaining Workload (Local DB)

3.51M
Total holders in DB
140,916
Remaining to process
199,732
Already matched
~137 hrs
Est. time (current GPU)

Google Cloud — CPU-Only Deployment

Running Ollama on CPU eliminates the GPU requirement at the cost of significantly lower token generation throughput. The model (llama3.1 8B Q4_K_M, ~4.5 GB) fits in system RAM on any instance with ≥ 8 GB, but inference speed is bottlenecked by the memory bus rather than compute.

Performance Comparison

HardwareToken/svs RTX 4060Rate (holders/min)Est. for 140K holders
RTX 4060 (current)~5517.2~137 h
n2-standard-8 (CPU, AVX2)~6–8~8–9×~2.0~1,174 h
n2-standard-4 (CPU, AVX2)~4–6~10–12×~1.5~1,566 h
Older CPU (no AVX2)< 2> 25×< 0.7> 3,300 h
CPU-only processing extends the 140K-holder job from ~6 days to ~49 days on a single instance. Parallelising across multiple VMs reduces elapsed time proportionally but does not reduce total cost.

Cost Estimate — CPU-Only (140,916 holders)

InstancevCPURAMOn-demand /hrPreemptible /hrDurationOn-demand totalPreemptible total
n2-standard-4416 GB $0.19$0.05~1,566 h$297$78
n2-standard-8832 GB $0.39$0.10~1,174 h$457$117
c2-standard-8 (compute-opt.)832 GB $0.42$0.11~1,000 h$420$110

Cost Estimate — GPU Instances (for comparison)

InstanceGPUVRAMOn-demand /hrSpot /hrDurationOn-demand totalSpot total
n1-standard-4 + T4T416 GB $0.35$0.10~118 h$41$12
g2-standard-4 + L4L424 GB $0.70$0.18~78 h$55$14
a2-highgpu-1g + A100A10040 GB $2.93$0.88~35 h$103$31
Key insight: A preemptible n1-standard-4 + T4 GPU instance costs ~$12 and finishes in 5 days — vs $78–117 and 49 days for CPU-only. GPU is cheaper and 10× faster. CPU-only is only justified when GPU quota is unavailable.

Practical CPU Recommendation

If GPU instances are unavailable, use n2-standard-8 preemptible with checkpointing (the pipeline already supports resume via MAX(source_id)). Run multiple instances across ID-range shards to parallelise. Expect ~$110 total for the current 140K backlog, or ~$800–1,200 if the full API stream adds another ~1M holders.

Minimum viable instance: n2-standard-4 (4 vCPU, 16 GB). The model occupies ~5 GB; the remaining 11 GB comfortably handles the OS, SQLite DB, and Node.js pipeline. AVX2 is required — all n2 instances support it.

Public Data Sources to Reduce Uncertain AI Matches

The 441 no-match holders (44% of the dry run) cannot be resolved by AI because FTS returns no candidates — the name shares no token with any indexed company. External registries can bridge this gap by adding officially-registered alternate names, transliterations, and cross-reference identifiers directly to the company index.

Tier 1 — Highest Impact, Free Bulk Download

GLEIF Golden Copy (LEI Register)

gleif.org · Free · Daily bulk download · ~2.2M entities

The authoritative global registry for Legal Entity Identifiers. Each record contains the official legal name, a transliterated name in Roman script (covers CJK, Cyrillic, Arabic), alternative business names, registration number, country, and LEI code.

Why it matters: Directly resolves Chinese transliteration mismatches (e.g. "雷岩" → "Leiyan") that caused DBA false negatives in the benchmark. LEI can be stored as a cross-reference key enabling definitive matching at 0.98 confidence, bypassing fuzzy and AI entirely.

Free Bulk CSV Transliterations Reg. numbers

Wikidata

wikidata.org · Free · SPARQL endpoint + dumps

Structured knowledge base with ~1M company entities. Properties include official name (P1448), aliases in 50+ languages, industry (P452), country (P17), official website, and links to national registries.

Why it matters: Strong coverage of brands, multinationals, and publicly-traded companies — exactly the entities that appear frequently as trademark holders. Alias lists resolve abbreviations and brand-name vs legal-name discrepancies.

Free SPARQL / dump Multilingual aliases

OpenCorporates

opencorporates.com · Free (non-commercial) / API · 200M+ registrations

Largest open database of company registrations, covering 140 jurisdictions. Includes official company name, registration number, jurisdiction, status, officers, and — crucially — alternative/trading names (DBAs) per filing.

Why it matters: DBA data is the primary gap in the current alias index. OpenCorporates bulk data would dramatically expand the aliases column with officially-registered trading names, reducing AI calls for DBA pairs.

Non-commercial free Bulk + API DBAs / trading names

Tier 2 — Domain-Specific (Trademark & Patent)

WIPO Global Brand Database

branddb.wipo.int · Free · Bulk via WIPO IP Portal

Trademark holder names normalized across 70+ national offices. Because WIPO already deduplicates holder names across filings, this is a pre-solved version of the exact problem the pipeline is tackling — holder name → legal entity.

Free Trademark-specific Deduped holders

USPTO TSDR

tsdr.uspto.gov · Free API · US trademarks

Full US trademark holder names with addresses and correspondent data. Cross-filing analysis links holders across applications, revealing parent/subsidiary relationships not visible in single-filing data.

Free API US-only

EPO PATSTAT

epo.org · Academic / commercial · Full patent applicant DB

Patent applicant names with EPO-standardized deduplication (person_name in Docdb tables). The EPO has already resolved many of the same name-variant ambiguities present in the patent applicant pipeline. PATSTAT Online offers SQL access; bulk download is available commercially.

Commercial Patent-specific Pre-deduped

Tier 3 — National Registries (Jurisdiction-Specific)

RegistryJurisdictionAccessKey data
Companies HouseUKFree APIOfficial name, reg. number, previous names, SIC codes
SEC EDGARUSFree bulkLegal entity name, CIK, SIC, state of incorporation
Sirene (INSEE)FranceFree bulkSIRET/SIREN, legal name, NAF code
HandelsregisterGermanyLimited APIHRB number, legal name, registered seat
ARCA / RolRomaniaFreeCUI, legal name, status

Recommended Integration Order

#SourceEffortExpected AI call reductionNew match method
1 GLEIF Low — one CSV ingest High — fixes CJK transliterations + adds LEI key lei
2 Wikidata Low — batch SPARQL Medium — resolves well-known brands & abbreviations wikidata
3 OpenCorporates Medium — license + ETL High — DBA coverage across 140 jurisdictions dba (enriched)
4 WIPO Brand DB Medium High for trademark-holder pipeline specifically wipo
5 PATSTAT High — large dataset, commercial High for patent-applicant pipeline specifically patstat
GLEIF is the single highest-ROI first step: one bulk CSV download (~1 GB compressed), a single ingest pipeline, and the LEI cross-reference immediately enables definitive matching at 0.98 confidence for any entity found — no fuzzy scoring or AI call needed. It also adds transliterated names to the alias index, directly addressing the Chinese DBA false-negative problem observed in the correctness benchmark.