WIMS Entities Mapper

AI Match Service — Productivity & Correctness

Local Ollama inference benchmark and correctness evaluation

Models llama3.1:latest · mistral:latest
Runtime Ollama (CPU, local)
Date 2026-05-30
Endpoint /api/chat · temperature 0

Productivity

llama3.1 — p50 latency
757 ms
mean 771 ms · p90 895 ms
mistral — p50 latency
1,023 ms
mean 1,082 ms · p90 1,465 ms
llama3.1 — throughput
~1.3 calls/s
~53 tok/s generation
mistral — throughput
~0.9 calls/s
~55 tok/s generation
Model p50 p90 Mean Prompt eval Generation Tok/s Est. / 1k calls
llama3.1:latest 757 ms 895 ms 771 ms ~33 ms ~658 ms 53 ~13 min
mistral:latest 1,023 ms 1,465 ms 1,082 ms ~37 ms ~965 ms 55 ~18 min
Concurrency does not help. Ollama serializes requests internally — p50 latency scales linearly with concurrency (×1→×2→×4) while wall-clock throughput stays flat at ~1.3 calls/s. The only path to higher throughput is OLLAMA_NUM_PARALLEL on the server, or running a second Ollama instance.
Practical ceiling: ~49k fuzzy match candidates in the database would take roughly 10.5 hours to re-score with llama3.1 sequentially (~15 h with mistral). AI scoring should be reserved for borderline fuzzy candidates (confidence 0.675–0.75), not applied to exact or high-confidence matches.
🎯

Correctness

Evaluated against a labeled set of 58 annotated pairs drawn from existing entity_matches — exact, dba, and high-confidence fuzzy positives plus easy and hard synthetic negatives (same-country, shared-token company pairs). An additional 20 unlabeled low-confidence fuzzy pairs were included for observation. Best configuration: country context added to every prompt.

Overall results — best configuration (+ country context)

Model Accuracy Precision Recall F1 TP TN FP FN
mistral:latest Best 89.7% 89.3% 89.3% 89.3% 252733
llama3.1:latest 75.9% 66.7% 100% 80.0% 2816140

Per-tier breakdown

Tier N Description llama3.1 acc mistral acc
exact 15 Same normalized name 100% 100%
dba 7 Known aliases, conf ≥ 0.95 100% 57%
fuzzy-high 6 Fuzzy match, conf ≥ 0.85 100% 100%
neg-easy 15 Different country, unrelated name 100% 100%
neg-hard 15 Same country, shared token, different entity 6.7% 80%

Context improvements

Configuration llama3.1 F1 mistral F1 Notes
No context 78.9% 80.7% Baseline
+ Country context Best 80.0% 89.3% +8.6pp mistral; country mismatch note in prompt
+ Country + Categories 78.9% 84.2% Categories hurt: only 3 coarse domain values
🔬

Model Profiles

llama3.1:latest — Aggressive matcher

Recall = 100%, zero false negatives. Never misses a real match. However it produces a high false-positive rate on hard negatives — it treats structural similarity ("Shenzhen X Technology Co." ≈ "Shenzhen Y Technology Co.") as evidence of a match, relying on city prefix and entity-type suffix rather than the actual company name. Country context has minimal effect on same-country pairs.

Best used when downstream review will catch false positives, or when missing a true match is more costly than surfacing a false one.

mistral:latest — Calibrated matcher

F1 = 89.3% with country context. Balanced precision and recall. Its reasoning explicitly cites country mismatches and structural differences when rejecting pairs, making decisions auditable. Country context improved its F1 by +8.6pp — the most responsive model to prompt enrichment. Remaining errors are mostly on DBA pairs where the company names are visibly unrelated (holding-company ↔ brand-name style entries).

Best used for production scoring where precision matters and results feed directly into the database without manual review.

Recommendation

Use mistral for production AI scoring on fuzzy match candidates where country data is available. It delivers the best precision/recall balance at F1 = 89.3%. Always pass source_country and company_country in context. Omit category context until the dataset has sub-industry granularity beyond the current 3 broad domain values.
Scope AI to borderline cases only. Exact and high-confidence DBA/fuzzy matches (conf ≥ 0.85) are already correct — running them through AI adds latency with no accuracy gain. Target the ~22,748 fuzzy matches at confidence 0.675 where AI filtering would have the highest impact.
Sequential throughput is the binding constraint. At ~0.9 calls/s, scoring all 22k borderline candidates would take ~7 hours. Plan for an overnight batch run, or configure OLLAMA_NUM_PARALLEL to increase server-side concurrency.