Productivity
| Model | p50 | p90 | Mean | Prompt eval | Generation | Tok/s | Est. / 1k calls |
|---|---|---|---|---|---|---|---|
| llama3.1:latest | 757 ms | 895 ms | 771 ms | ~33 ms | ~658 ms | 53 | ~13 min |
| mistral:latest | 1,023 ms | 1,465 ms | 1,082 ms | ~37 ms | ~965 ms | 55 | ~18 min |
OLLAMA_NUM_PARALLEL on the server, or running a second Ollama instance.
Correctness
Evaluated against a labeled set of 58 annotated pairs drawn from existing entity_matches — exact, dba, and high-confidence fuzzy positives plus easy and hard synthetic negatives (same-country, shared-token company pairs). An additional 20 unlabeled low-confidence fuzzy pairs were included for observation. Best configuration: country context added to every prompt.
Overall results — best configuration (+ country context)
| Model | Accuracy | Precision | Recall | F1 | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|---|
| mistral:latest Best | 89.7% | 89.3% | 89.3% | 89.3% | 25 | 27 | 3 | 3 |
| llama3.1:latest | 75.9% | 66.7% | 100% | 80.0% | 28 | 16 | 14 | 0 |
Per-tier breakdown
| Tier | N | Description | llama3.1 acc | mistral acc |
|---|---|---|---|---|
| exact | 15 | Same normalized name | 100% | 100% |
| dba | 7 | Known aliases, conf ≥ 0.95 | 100% | 57% |
| fuzzy-high | 6 | Fuzzy match, conf ≥ 0.85 | 100% | 100% |
| neg-easy | 15 | Different country, unrelated name | 100% | 100% |
| neg-hard | 15 | Same country, shared token, different entity | 6.7% | 80% |
Context improvements
| Configuration | llama3.1 F1 | mistral F1 | Notes |
|---|---|---|---|
| No context | 78.9% | 80.7% | Baseline |
| + Country context Best | 80.0% | 89.3% | +8.6pp mistral; country mismatch note in prompt |
| + Country + Categories | 78.9% | 84.2% | Categories hurt: only 3 coarse domain values |
Model Profiles
llama3.1:latest — Aggressive matcher
Recall = 100%, zero false negatives. Never misses a real match. However it produces a high false-positive rate on hard negatives — it treats structural similarity ("Shenzhen X Technology Co." ≈ "Shenzhen Y Technology Co.") as evidence of a match, relying on city prefix and entity-type suffix rather than the actual company name. Country context has minimal effect on same-country pairs.
Best used when downstream review will catch false positives, or when missing a true match is more costly than surfacing a false one.
mistral:latest — Calibrated matcher
F1 = 89.3% with country context. Balanced precision and recall. Its reasoning explicitly cites country mismatches and structural differences when rejecting pairs, making decisions auditable. Country context improved its F1 by +8.6pp — the most responsive model to prompt enrichment. Remaining errors are mostly on DBA pairs where the company names are visibly unrelated (holding-company ↔ brand-name style entries).
Best used for production scoring where precision matters and results feed directly into the database without manual review.
Recommendation
source_country and company_country in context. Omit category context until the dataset has sub-industry granularity beyond the current 3 broad domain values.
OLLAMA_NUM_PARALLEL to increase server-side concurrency.