Model Selection
Model selection is about balancing:
Accuracy + latency + cost + real-world performance
rather than optimizing a single metric.
Model Size
SLM vs LLM
| Type | Description |
|---|---|
| SLM (Small Language Model) | Smaller models optimized for specific tasks, lower latency, and reduced compute requirements |
| LLM (Large Language Model) | Large general-purpose models capable of handling multiple tasks and broad reasoning |
Typical Tradeoff
| Feature | SLM | LLM |
|---|---|---|
| Compute Cost | Lower | Higher |
| Latency | Faster | Slower |
| Generalization | Limited | Strong |
| Domain Specialization | Strong | Moderate |
| Memory Usage | Lower | Higher |
1. Model Accuracy
How often a model predicts correctly on unseen data.
Example:
95 correct predictions out of 100
→ Accuracy = 95%
2. BLEU: Bilingual Evaluation Understudy Score
Measure precision overlap between generated text and reference text.
Simplified BLEU Formula
Where:
- = brevity penalty
- = n-gram precision
- = weights
So
Higher overlap → higher BLEU score.
- we don't punish long candidates, and only punish short candidates.
Used mainly for:
- machine translation
- text generation evaluation
Example:
| Reference | "The cat sits on the mat" |
|---|---|
| Generated | "The cat is on the mat" |
3. ROUGE Score
How much important reference content was captured.
ROUGE stands for:
Recall-Oriented Understudy for Gisting Evaluation
Simplified ROUGE Formula
Higher scores indicating higher similarity between the automatically produced summary and the reference.
Focus:
- recall
- content coverage
Used mainly for:
- Summarization Text
BLEU vs ROUGE
| Metric | Focus | Common Use |
|---|---|---|
| BLEU | Precision | Translation |
| ROUGE | Recall | Summarization |
4. Cosine Similarity
Measure Semantic similarity between vector embeddings.
It compares the angle between vectors.
Range:
| Value | Meaning |
|---|---|
| 1 | Very similar |
| 0 | Unrelated |
| -1 | Opposite direction |
Embedding Similarity Example
flowchart TD
A["'Fast GPU computing'"]
--> C["Embedding Space"]
B["'Parallel GPU processing'"]
--> C
C --> D["High Cosine Similarity"]
5. Cross-Validation
Cross-validation evaluates models using multiple data splits.
Purpose:
- estimate generalization performance
- reduce overfitting risk
6. K-Fold Cross Validation
Each fold becomes the validation set once.
Training strategy:
flowchart LR
A["Fold 1"]
B["Fold 2"]
C["Fold 3"]
D["Fold 4"]
E["Fold 5"]
F["Train on 4 folds<br/>Validate on 1 fold"]
A --> F
B --> F
C --> F
D --> F
E --> F
Benefits:
- better performance estimation
- improved robustness
- reduced dataset bias
Useful when:
- datasets are small
- evaluation data is limited
7. A/B Testing
A/B testing compares two model versions using real users.
Purpose:
- measure production performance
- validate improvements safely
A/B Testing Workflow
flowchart TD
A["Users"]
--> B["Traffic Split"]
B --> C["Model A"]
B --> D["Model B"]
C --> E["Metrics Collection"]
D --> E
Common A/B Testing Metrics
| Metric | Example |
|---|---|
| Click-through rate | Recommendation systems |
| Latency | AI inference |
| User satisfaction | Chatbots |
| Conversion rate | AI assistants |
| Engagement | Content generation |
Offline vs Online Evaluation
| Type | Description |
|---|---|
| Offline Evaluation | Uses datasets and metrics |
| Online Evaluation | Uses real user traffic |
Simplified Mental Model
| Concept | Purpose |
|---|---|
Accuracy |
Correct predictions |
BLEU |
Translation quality |
ROUGE |
Summarization quality |
Cosine Similarity |
Semantic similarity |
Cross-validation |
Reliable evaluation |
A/B Testing |
Real-world comparison |