Generated: · Based on dry-run benchmark: 1,000 trademark holders, AI fallback enabled
The AI fallback (Ollama llama3.1 8B Q4_K_M) is responsible for 85% of all successful matches, confirming it is essential — not supplementary — to the pipeline. Without AI, the pipeline would have matched only 84 holders (8.4%) instead of 559 (55.9%).
The 441 no-match holders (44%) represent the ceiling for improvement. Most either have no name-token overlap with any company in the index (FTS returns nothing, so AI is never invoked), or the names are too semantically distant for current models. External data sources are the primary lever to shrink this group.
Running Ollama on CPU eliminates the GPU requirement at the cost of significantly lower token generation throughput. The model (llama3.1 8B Q4_K_M, ~4.5 GB) fits in system RAM on any instance with ≥ 8 GB, but inference speed is bottlenecked by the memory bus rather than compute.
| Hardware | Token/s | vs RTX 4060 | Rate (holders/min) | Est. for 140K holders |
|---|---|---|---|---|
| RTX 4060 (current) | ~55 | 1× | 17.2 | ~137 h |
| n2-standard-8 (CPU, AVX2) | ~6–8 | ~8–9× | ~2.0 | ~1,174 h |
| n2-standard-4 (CPU, AVX2) | ~4–6 | ~10–12× | ~1.5 | ~1,566 h |
| Older CPU (no AVX2) | < 2 | > 25× | < 0.7 | > 3,300 h |
| Instance | vCPU | RAM | On-demand /hr | Preemptible /hr | Duration | On-demand total | Preemptible total |
|---|---|---|---|---|---|---|---|
n2-standard-4 | 4 | 16 GB | $0.19 | $0.05 | ~1,566 h | $297 | $78 |
n2-standard-8 | 8 | 32 GB | $0.39 | $0.10 | ~1,174 h | $457 | $117 |
c2-standard-8 (compute-opt.) | 8 | 32 GB | $0.42 | $0.11 | ~1,000 h | $420 | $110 |
| Instance | GPU | VRAM | On-demand /hr | Spot /hr | Duration | On-demand total | Spot total |
|---|---|---|---|---|---|---|---|
n1-standard-4 + T4 | T4 | 16 GB | $0.35 | $0.10 | ~118 h | $41 | $12 |
g2-standard-4 + L4 | L4 | 24 GB | $0.70 | $0.18 | ~78 h | $55 | $14 |
a2-highgpu-1g + A100 | A100 | 40 GB | $2.93 | $0.88 | ~35 h | $103 | $31 |
n1-standard-4 + T4 GPU instance costs ~$12
and finishes in 5 days — vs $78–117 and 49 days for CPU-only.
GPU is cheaper and 10× faster. CPU-only is only justified when GPU quota is unavailable.
If GPU instances are unavailable, use n2-standard-8 preemptible with checkpointing
(the pipeline already supports resume via MAX(source_id)). Run multiple instances
across ID-range shards to parallelise. Expect ~$110 total for the current 140K backlog,
or ~$800–1,200 if the full API stream adds another ~1M holders.
Minimum viable instance: n2-standard-4 (4 vCPU, 16 GB). The model occupies ~5 GB;
the remaining 11 GB comfortably handles the OS, SQLite DB, and Node.js pipeline.
AVX2 is required — all n2 instances support it.
The 441 no-match holders (44% of the dry run) cannot be resolved by AI because FTS returns no candidates — the name shares no token with any indexed company. External registries can bridge this gap by adding officially-registered alternate names, transliterations, and cross-reference identifiers directly to the company index.
The authoritative global registry for Legal Entity Identifiers. Each record contains the official legal name, a transliterated name in Roman script (covers CJK, Cyrillic, Arabic), alternative business names, registration number, country, and LEI code.
Why it matters: Directly resolves Chinese transliteration mismatches (e.g. "雷岩" → "Leiyan") that caused DBA false negatives in the benchmark. LEI can be stored as a cross-reference key enabling definitive matching at 0.98 confidence, bypassing fuzzy and AI entirely.
Structured knowledge base with ~1M company entities. Properties include official name
(P1448), aliases in 50+ languages, industry (P452),
country (P17), official website, and links to national registries.
Why it matters: Strong coverage of brands, multinationals, and publicly-traded companies — exactly the entities that appear frequently as trademark holders. Alias lists resolve abbreviations and brand-name vs legal-name discrepancies.
Largest open database of company registrations, covering 140 jurisdictions. Includes official company name, registration number, jurisdiction, status, officers, and — crucially — alternative/trading names (DBAs) per filing.
Why it matters: DBA data is the primary gap in the current alias index.
OpenCorporates bulk data would dramatically expand the aliases column
with officially-registered trading names, reducing AI calls for DBA pairs.
Trademark holder names normalized across 70+ national offices. Because WIPO already deduplicates holder names across filings, this is a pre-solved version of the exact problem the pipeline is tackling — holder name → legal entity.
Full US trademark holder names with addresses and correspondent data. Cross-filing analysis links holders across applications, revealing parent/subsidiary relationships not visible in single-filing data.
Patent applicant names with EPO-standardized deduplication
(person_name in Docdb tables). The EPO has already resolved
many of the same name-variant ambiguities present in the patent applicant pipeline.
PATSTAT Online offers SQL access; bulk download is available commercially.
| Registry | Jurisdiction | Access | Key data |
|---|---|---|---|
| Companies House | UK | Free API | Official name, reg. number, previous names, SIC codes |
| SEC EDGAR | US | Free bulk | Legal entity name, CIK, SIC, state of incorporation |
| Sirene (INSEE) | France | Free bulk | SIRET/SIREN, legal name, NAF code |
| Handelsregister | Germany | Limited API | HRB number, legal name, registered seat |
| ARCA / Rol | Romania | Free | CUI, legal name, status |
| # | Source | Effort | Expected AI call reduction | New match method |
|---|---|---|---|---|
| 1 | GLEIF | Low — one CSV ingest | High — fixes CJK transliterations + adds LEI key | lei |
| 2 | Wikidata | Low — batch SPARQL | Medium — resolves well-known brands & abbreviations | wikidata |
| 3 | OpenCorporates | Medium — license + ETL | High — DBA coverage across 140 jurisdictions | dba (enriched) |
| 4 | WIPO Brand DB | Medium | High for trademark-holder pipeline specifically | wipo |
| 5 | PATSTAT | High — large dataset, commercial | High for patent-applicant pipeline specifically | patstat |