WIMS Entities Mapper — Implementation Plan

API Findings

Base URL & Auth

Host: https://wims-api.winglegroup.com — PHP 7.4 / Symfony / API Platform (Hydra / JSON-LD)

Key Entity Counts

Companies

13,739

/api/companies

Brands

17,332

/api/brands

Products

78,719

/api/products

Patents

2,807,127

/api/patents

Trademarks (TM items)

4,545,489

/api/trademark_items

Trademark Holders

4,552,327

/api/trademark_holders

Patent Applicants (raw)

6,532,620

/api/patent_applicants

Unique Applicant Names

1,479,130

/api/patents/applicants

Existing Relationships (partial coverage)

The API already exposes relationship fields, but they are mostly empty — e.g. patent.companiesAmount = 0 for the vast majority of records:

✅
Company.patents / Company.trademarks — patents & trademarks linked to a company
Sparsely populated
✅
Product.manufacturers / Product.brands / Product.patents — company, brand & patent links on products
Partially populated
✅
Brand.ownerCompanies — companies owning the brand
⚠️
TrademarkItem.holders → TrademarkHolder.name — free-text only, no FK to Companies
⚠️
PatentApplicant.applicant → Applicant.name — free-text only, no FK to Companies

🔑

Core problem: TrademarkHolder.name = "Philip Morris Products S.A." and PatentApplicant.name = "Philip Morris Products Sa" both carry no foreign key to the Companies table. This pipeline's entire purpose is to bridge that gap with confidence-scored matches.

Architecture

Matching Pipeline

Step 1 Data Collection

Companies (13K) + DBAs + reg numbers + URLs
TM Holders (4.5M unique names)
Patent Applicants (1.5M unique names)
Products (79K) + brands + manufacturers
Brands (17K) + ownerCompanies

↓

Step 2 Entity Normalization

Strip legal suffixes (Inc, LLC, S.A., Ltd, GmbH…)
Lowercase + remove punctuation / diacritics
Tokenize + sort tokens
Build inverted index of companies by normalized name

↓

Step 3 Rule-based Matching confidence 0.90 – 1.0

Exact match on normalized name
DBA / alias match
Registration number match
Domain / URL match

↓

Step 4 Fuzzy Matching confidence 0.60 – 0.89

Token sort ratio ≥ 85 (fuzzball)
Longest common prefix matching
Bigram / trigram overlap score
Dedupe & rank candidates per entity

↓

Step 5 AI Relationship Scoring confidence 0.30 – 0.69

Ollama (local) — llama3.1:8b or mistral:7b
Prompt: "Are these two names the same company?"
Context injection: country, address, category
Only run on unmatched or borderline candidates

↓

Step 6 Output & Manual Review Queue

Tables: entity_matches, review_queue
JSON exports by connection type
Manual review queue for conf < 0.75

Connection Types (priority order)

TM → Company

Source: TrademarkHolder.name, country → Target: Company.name, .dbas, .registrationNumber

Patent → Company

Source: Applicant.name, country → Target: Company.name, .dbas, .registrationNumber

TM → Product

Source: TrademarkItem.name → Target: Product.name, Brand.name

Patent → Product

Source: via Company link + applicant name → Target: Product.manufacturers, Product.patents

TM → Patent

Source: shared Company + brand/product context → Target: Applicant.name ≈ Holder.name

Company → Product

Already partially exists — extend coverage

Brand Ecosystem Mapping

Source: Brand.ownerCompanies + TM name → Target: cross-entity brand mapping

Technical Stack

Runtime
Node.js v20+required; async streams for large dataset pagination
Storage
SQLite (better-sqlite3) — zero infrastructure, handles 10 M+ rows, ships everywhere
Full-text search
SQLite FTS5— built-in trigram search; replaces pg_trgm with no server required
Fuzzy matching
fuzzball (JS port of RapidFuzz) — token_sort_ratio, partial_ratio
Normalization
Custom module— no extra deps; handles legal suffixes, Unicode, transliteration
AI matching
Ollama (local LLM) — llama3.1:8b or mistral:7b; free, private, runs offline
Vector (Phase 2)
Ollama embeddings + in-memory cosine— deferred until dataset outgrows in-process approach
HTTP client
Node.js built-in https— zero dependency footprint
Config
dotenv— reads .env credentials

💡

Why not PostgreSQL / Elasticsearch? The company index is only 13,739 records — SQLite FTS5 handles all lookups in milliseconds. External services add deployment complexity with no measurable gain for a pipeline tool.

🔒

Why Ollama over cloud AI? All data stays on-premises, zero API cost, works fully offline. llama3.1:8b fits in 8 GB VRAM or runs on CPU.

Matching Logic Detail

Name Normalization

Applied uniformly to every entity name before any comparison:

"Philip Morris Products S.A."
  → lowercase:        "philip morris products s.a."
  → remove suffixes:  "philip morris products"
  → remove punct:     "philip morris products"
  → token sort:       "morris philip products"
  → stored as         normalized_name

Legal suffix strip list: inc llc corp ltd limited sa s.a. plc p.l.c. gmbh ag bv nv srl sas pty co company international group holding holdings

Confidence Scoring Model

Exact normalized match

1.00

Registration number

0.98

DBA / alias match

0.95

Fuzzy score 95 – 100

0.90

Fuzzy score 85 – 94

0.75

Fuzzy score 70 – 84

0.55

AI says "yes"

0.30 – 0.85

Multipliers applied on top of base score (capped at 1.0):

Same country match → × 1.05
Different country → × 0.90
Matched via DBA / alias → × 1.02

Filtering Strategy for Large Datasets

Rather than downloading all 4.5 M TM holders row-by-row:

1

Stream unique holder names (deduplicated in SQLite as they arrive) — reducing ~4.5 M rows to ~1 M distinct strings.
2

For each unique name, run the matching pipeline against the 13,739-company index (held in memory / FTS5).
3

Write matches to SQLite; skip non-matching holders — they belong to entities outside the WIMS company scope.

📉

This reduces the effective working set from 4.5 M raw records → ~1 M unique applicants / ~1 M unique holders with a single streaming pass.

Key Risks & Mitigations

Risk	Impact	Mitigation
API rate limiting	High — could block data collection entirely	100 ms delay between pages; exponential backoff on HTTP 429
Token expiry (1 h TTL)	Medium — pipeline breaks mid-run	Auto-refresh via `refresh_token` before each batch request
Name ambiguity (139 "Philip Morris" variants found)	High — wrong company linked	Always surface top-3 candidates with scores; human review for confidence < 0.75
Fuzzy false positives (e.g. "BAT" → wrong company)	Medium — pollutes match quality	Short names (< 5 chars) require stricter threshold (≥ 0.95)
Ollama model hallucination	Low — minor score distortion	AI used as tie-breaker only; never sole basis for a high-confidence match
Large dataset memory pressure	Low — OOM crash on weak machines	Stream and process in chunks of 1,000; write to SQLite incrementally

Summary

Total entities to match

~6 M TM holders + applicants → 13,739 companies

MVP connection types

4 (TM→Company, Patent→Company, TM→Product, Patent→Product)

Phase 2 connections

+ 3 (TM→Patent, Company→Product, Brand ecosystem)

Tech stack

Node.js + SQLite + fuzzball + Ollama

Phase 1

Rule-based matching

Phase 2

Fuzzy matching + extended connections

Phase 3

AI scoring + brand ecosystems

Phase 2 enhancements (future)

pgvector / Qdrant · Scheduled re-matching · Confidence calibration · Webhook integration