AI Match Service — Productivity & Correctness

⚡

Productivity

llama3.1 — p50 latency

757 ms

mean 771 ms · p90 895 ms

mistral — p50 latency

1,023 ms

mean 1,082 ms · p90 1,465 ms

llama3.1 — throughput

~1.3 calls/s

~53 tok/s generation

mistral — throughput

~0.9 calls/s

~55 tok/s generation

Model	p50	p90	Mean	Prompt eval	Generation	Tok/s	Est. / 1k calls
llama3.1:latest	757 ms	895 ms	771 ms	~33 ms	~658 ms	53	~13 min
mistral:latest	1,023 ms	1,465 ms	1,082 ms	~37 ms	~965 ms	55	~18 min

Concurrency does not help. Ollama serializes requests internally — p50 latency scales linearly with concurrency (×1→×2→×4) while wall-clock throughput stays flat at ~1.3 calls/s. The only path to higher throughput is OLLAMA_NUM_PARALLEL on the server, or running a second Ollama instance.

Practical ceiling: ~49k fuzzy match candidates in the database would take roughly 10.5 hours to re-score with llama3.1 sequentially (~15 h with mistral). AI scoring should be reserved for borderline fuzzy candidates (confidence 0.675–0.75), not applied to exact or high-confidence matches.

🎯

Correctness

Evaluated against a labeled set of 58 annotated pairs drawn from existing entity_matches — exact, dba, and high-confidence fuzzy positives plus easy and hard synthetic negatives (same-country, shared-token company pairs). An additional 20 unlabeled low-confidence fuzzy pairs were included for observation. Best configuration: country context added to every prompt.

Overall results — best configuration (+ country context)

Model	Accuracy	Precision	Recall	F1	TP	TN	FP	FN
mistral:latest Best	89.7%	89.3%	89.3%	89.3%	25	27	3	3
llama3.1:latest	75.9%	66.7%	100%	80.0%	28	16	14	0

Per-tier breakdown

Tier	N	Description	llama3.1 acc	mistral acc
exact	15	Same normalized name	100%	100%
dba	7	Known aliases, conf ≥ 0.95	100%	57%
fuzzy-high	6	Fuzzy match, conf ≥ 0.85	100%	100%
neg-easy	15	Different country, unrelated name	100%	100%
neg-hard	15	Same country, shared token, different entity	6.7%	80%

Context improvements

Configuration	llama3.1 F1	mistral F1	Notes
No context	78.9%	80.7%	Baseline
+ Country context Best	80.0%	89.3%	+8.6pp mistral; country mismatch note in prompt
+ Country + Categories	78.9%	84.2%	Categories hurt: only 3 coarse domain values

🔬

Model Profiles

llama3.1:latest — Aggressive matcher

Recall = 100%, zero false negatives. Never misses a real match. However it produces a high false-positive rate on hard negatives — it treats structural similarity ("Shenzhen X Technology Co." ≈ "Shenzhen Y Technology Co.") as evidence of a match, relying on city prefix and entity-type suffix rather than the actual company name. Country context has minimal effect on same-country pairs.

Best used when downstream review will catch false positives, or when missing a true match is more costly than surfacing a false one.

mistral:latest — Calibrated matcher

F1 = 89.3% with country context. Balanced precision and recall. Its reasoning explicitly cites country mismatches and structural differences when rejecting pairs, making decisions auditable. Country context improved its F1 by +8.6pp — the most responsive model to prompt enrichment. Remaining errors are mostly on DBA pairs where the company names are visibly unrelated (holding-company ↔ brand-name style entries).

Best used for production scoring where precision matters and results feed directly into the database without manual review.

✅

Recommendation

Use mistral for production AI scoring on fuzzy match candidates where country data is available. It delivers the best precision/recall balance at F1 = 89.3%. Always pass source_country and company_country in context. Omit category context until the dataset has sub-industry granularity beyond the current 3 broad domain values.

Scope AI to borderline cases only. Exact and high-confidence DBA/fuzzy matches (conf ≥ 0.85) are already correct — running them through AI adds latency with no accuracy gain. Target the ~22,748 fuzzy matches at confidence 0.675 where AI filtering would have the highest impact.

Sequential throughput is the binding constraint. At ~0.9 calls/s, scoring all 22k borderline candidates would take ~7 hours. Plan for an overnight batch run, or configure OLLAMA_NUM_PARALLEL to increase server-side concurrency.