API Findings
Base URL & Auth
https://wims-api.winglegroup.com
— PHP 7.4 / Symfony / API Platform (Hydra / JSON-LD)
Key Entity Counts
Existing Relationships (partial coverage)
The API already exposes relationship fields, but they are mostly empty — e.g. patent.companiesAmount = 0 for the vast majority of records:
Company.patents/Company.trademarks— patents & trademarks linked to a companySparsely populatedProduct.manufacturers/Product.brands/Product.patents— company, brand & patent links on productsPartially populatedBrand.ownerCompanies— companies owning the brandTrademarkItem.holders → TrademarkHolder.name— free-text only, no FK to CompaniesPatentApplicant.applicant → Applicant.name— free-text only, no FK to Companies
Core problem: TrademarkHolder.name = "Philip Morris Products S.A." and PatentApplicant.name = "Philip Morris Products Sa" both carry no foreign key to the Companies table. This pipeline's entire purpose is to bridge that gap with confidence-scored matches.
Architecture
Matching Pipeline
- Companies (13K) + DBAs + reg numbers + URLs
- TM Holders (4.5M unique names)
- Patent Applicants (1.5M unique names)
- Products (79K) + brands + manufacturers
- Brands (17K) + ownerCompanies
- Strip legal suffixes (Inc, LLC, S.A., Ltd, GmbH…)
- Lowercase + remove punctuation / diacritics
- Tokenize + sort tokens
- Build inverted index of companies by normalized name
- Exact match on normalized name
- DBA / alias match
- Registration number match
- Domain / URL match
- Token sort ratio ≥ 85 (fuzzball)
- Longest common prefix matching
- Bigram / trigram overlap score
- Dedupe & rank candidates per entity
- Ollama (local) — llama3.1:8b or mistral:7b
- Prompt: "Are these two names the same company?"
- Context injection: country, address, category
- Only run on unmatched or borderline candidates
- Tables: entity_matches, review_queue
- JSON exports by connection type
- Manual review queue for conf < 0.75
Connection Types (priority order)
TrademarkHolder.name, country →
Target: Company.name, .dbas, .registrationNumber
Applicant.name, country →
Target: Company.name, .dbas, .registrationNumber
TrademarkItem.name →
Target: Product.name, Brand.name
Product.manufacturers, Product.patents
Applicant.name ≈ Holder.name
Brand.ownerCompanies + TM name →
Target: cross-entity brand mapping
Technical Stack
- RuntimeNode.js v20+required; async streams for large dataset pagination
- StorageSQLite (better-sqlite3) — zero infrastructure, handles 10 M+ rows, ships everywhere
- Full-text searchSQLite FTS5— built-in trigram search; replaces pg_trgm with no server required
- Fuzzy matchingfuzzball (JS port of RapidFuzz) — token_sort_ratio, partial_ratio
- NormalizationCustom module— no extra deps; handles legal suffixes, Unicode, transliteration
- AI matchingOllama (local LLM) — llama3.1:8b or mistral:7b; free, private, runs offline
- Vector (Phase 2)Ollama embeddings + in-memory cosine— deferred until dataset outgrows in-process approach
- HTTP clientNode.js built-in https— zero dependency footprint
- Configdotenv— reads .env credentials
Why not PostgreSQL / Elasticsearch? The company index is only 13,739 records — SQLite FTS5 handles all lookups in milliseconds. External services add deployment complexity with no measurable gain for a pipeline tool.
Why Ollama over cloud AI? All data stays on-premises, zero API cost, works fully offline. llama3.1:8b fits in 8 GB VRAM or runs on CPU.
Matching Logic Detail
Name Normalization
Applied uniformly to every entity name before any comparison:
"Philip Morris Products S.A." → lowercase: "philip morris products s.a." → remove suffixes: "philip morris products" → remove punct: "philip morris products" → token sort: "morris philip products" → stored as normalized_name
Legal suffix strip list:
inc llc corp ltd limited
sa s.a. plc p.l.c. gmbh
ag bv nv srl sas
pty co company international
group holding holdings
Confidence Scoring Model
- Same country match →
× 1.05 - Different country →
× 0.90 - Matched via DBA / alias →
× 1.02
Filtering Strategy for Large Datasets
Rather than downloading all 4.5 M TM holders row-by-row:
-
1Stream unique holder names (deduplicated in SQLite as they arrive) — reducing ~4.5 M rows to ~1 M distinct strings.
-
2For each unique name, run the matching pipeline against the 13,739-company index (held in memory / FTS5).
-
3Write matches to SQLite; skip non-matching holders — they belong to entities outside the WIMS company scope.
This reduces the effective working set from 4.5 M raw records → ~1 M unique applicants / ~1 M unique holders with a single streaming pass.
Key Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| API rate limiting | High — could block data collection entirely | 100 ms delay between pages; exponential backoff on HTTP 429 |
| Token expiry (1 h TTL) | Medium — pipeline breaks mid-run | Auto-refresh via refresh_token before each batch request |
| Name ambiguity (139 "Philip Morris" variants found) | High — wrong company linked | Always surface top-3 candidates with scores; human review for confidence < 0.75 |
| Fuzzy false positives (e.g. "BAT" → wrong company) | Medium — pollutes match quality | Short names (< 5 chars) require stricter threshold (≥ 0.95) |
| Ollama model hallucination | Low — minor score distortion | AI used as tie-breaker only; never sole basis for a high-confidence match |
| Large dataset memory pressure | Low — OOM crash on weak machines | Stream and process in chunks of 1,000; write to SQLite incrementally |
Summary
Node.js + SQLite + fuzzball + Ollama