WIMS · Research & Analysis

Entities Mapper — Implementation Plan

Rule-based + Fuzzy + AI matching pipeline  ·  TM ↔ Companies ↔ Patents ↔ Products
1

API Findings

Base URL & Auth

Host: https://wims-api.winglegroup.com   —   PHP 7.4 / Symfony / API Platform (Hydra / JSON-LD)

Key Entity Counts

Companies
13,739
/api/companies
Brands
17,332
/api/brands
Products
78,719
/api/products
Patents
2,807,127
/api/patents
Trademarks (TM items)
4,545,489
/api/trademark_items
Trademark Holders
4,552,327
/api/trademark_holders
Patent Applicants (raw)
6,532,620
/api/patent_applicants
Unique Applicant Names
1,479,130
/api/patents/applicants

Existing Relationships (partial coverage)

The API already exposes relationship fields, but they are mostly empty — e.g. patent.companiesAmount = 0 for the vast majority of records:

🔑

Core problem: TrademarkHolder.name = "Philip Morris Products S.A." and PatentApplicant.name = "Philip Morris Products Sa" both carry no foreign key to the Companies table. This pipeline's entire purpose is to bridge that gap with confidence-scored matches.

2

Architecture

Matching Pipeline

Step 1 Data Collection
  • Companies (13K) + DBAs + reg numbers + URLs
  • TM Holders (4.5M unique names)
  • Patent Applicants (1.5M unique names)
  • Products (79K) + brands + manufacturers
  • Brands (17K) + ownerCompanies
Step 2 Entity Normalization
  • Strip legal suffixes (Inc, LLC, S.A., Ltd, GmbH…)
  • Lowercase + remove punctuation / diacritics
  • Tokenize + sort tokens
  • Build inverted index of companies by normalized name
Step 3 Rule-based Matching confidence 0.90 – 1.0
  • Exact match on normalized name
  • DBA / alias match
  • Registration number match
  • Domain / URL match
Step 4 Fuzzy Matching confidence 0.60 – 0.89
  • Token sort ratio ≥ 85 (fuzzball)
  • Longest common prefix matching
  • Bigram / trigram overlap score
  • Dedupe & rank candidates per entity
Step 5 AI Relationship Scoring confidence 0.30 – 0.69
  • Ollama (local) — llama3.1:8b or mistral:7b
  • Prompt: "Are these two names the same company?"
  • Context injection: country, address, category
  • Only run on unmatched or borderline candidates
Step 6 Output & Manual Review Queue
  • Tables: entity_matches, review_queue
  • JSON exports by connection type
  • Manual review queue for conf < 0.75

Connection Types (priority order)

1
TM → Company
Source: TrademarkHolder.name, country  →  Target: Company.name, .dbas, .registrationNumber
2
Patent → Company
Source: Applicant.name, country  →  Target: Company.name, .dbas, .registrationNumber
3
TM → Product
Source: TrademarkItem.name  →  Target: Product.name, Brand.name
4
Patent → Product
Source: via Company link + applicant name  →  Target: Product.manufacturers, Product.patents
5
TM → Patent
Source: shared Company + brand/product context  →  Target: Applicant.nameHolder.name
6
Company → Product
Already partially exists — extend coverage
7
Brand Ecosystem Mapping
Source: Brand.ownerCompanies + TM name  →  Target: cross-entity brand mapping
3

Technical Stack

💡

Why not PostgreSQL / Elasticsearch? The company index is only 13,739 records — SQLite FTS5 handles all lookups in milliseconds. External services add deployment complexity with no measurable gain for a pipeline tool.

🔒

Why Ollama over cloud AI? All data stays on-premises, zero API cost, works fully offline. llama3.1:8b fits in 8 GB VRAM or runs on CPU.

4

Matching Logic Detail

Name Normalization

Applied uniformly to every entity name before any comparison:

"Philip Morris Products S.A."
  → lowercase:        "philip morris products s.a."
  → remove suffixes:  "philip morris products"
  → remove punct:     "philip morris products"
  → token sort:       "morris philip products"
  → stored as         normalized_name

Legal suffix strip list: inc llc corp ltd limited sa s.a. plc p.l.c. gmbh ag bv nv srl sas pty co company international group holding holdings

Confidence Scoring Model

Exact normalized match
1.00
Registration number
0.98
DBA / alias match
0.95
Fuzzy score 95 – 100
0.90
Fuzzy score 85 – 94
0.75
Fuzzy score 70 – 84
0.55
AI says "yes"
0.30 – 0.85
Multipliers applied on top of base score (capped at 1.0):
  • Same country match  →  × 1.05
  • Different country  →  × 0.90
  • Matched via DBA / alias  →  × 1.02

Filtering Strategy for Large Datasets

Rather than downloading all 4.5 M TM holders row-by-row:

  1. 1
    Stream unique holder names (deduplicated in SQLite as they arrive) — reducing ~4.5 M rows to ~1 M distinct strings.
  2. 2
    For each unique name, run the matching pipeline against the 13,739-company index (held in memory / FTS5).
  3. 3
    Write matches to SQLite; skip non-matching holders — they belong to entities outside the WIMS company scope.
📉

This reduces the effective working set from 4.5 M raw records → ~1 M unique applicants / ~1 M unique holders with a single streaming pass.

5

Key Risks & Mitigations

Risk Impact Mitigation
API rate limiting High — could block data collection entirely 100 ms delay between pages; exponential backoff on HTTP 429
Token expiry (1 h TTL) Medium — pipeline breaks mid-run Auto-refresh via refresh_token before each batch request
Name ambiguity (139 "Philip Morris" variants found) High — wrong company linked Always surface top-3 candidates with scores; human review for confidence < 0.75
Fuzzy false positives (e.g. "BAT" → wrong company) Medium — pollutes match quality Short names (< 5 chars) require stricter threshold (≥ 0.95)
Ollama model hallucination Low — minor score distortion AI used as tie-breaker only; never sole basis for a high-confidence match
Large dataset memory pressure Low — OOM crash on weak machines Stream and process in chunks of 1,000; write to SQLite incrementally
6

Summary

Total entities to match
~6 M TM holders + applicants → 13,739 companies
MVP connection types
4  (TM→Company, Patent→Company, TM→Product, Patent→Product)
Phase 2 connections
+ 3  (TM→Patent, Company→Product, Brand ecosystem)
Tech stack
Node.js + SQLite + fuzzball + Ollama
Phase 1
Rule-based matching
Phase 2
Fuzzy matching + extended connections
Phase 3
AI scoring + brand ecosystems
Phase 2 enhancements (future)
pgvector / Qdrant · Scheduled re-matching · Confidence calibration · Webhook integration