Supported dialects
| Dialect | Code | Regions | Key markers |
|---|---|---|---|
| Gulf | gulf | Saudi Arabia, UAE, Kuwait, Qatar, Bahrain, Oman | شلون، وايد، زين، حلو، يالله |
| Egyptian | egyptian | Egypt | أوي، ده، دي، عايز، ازاي، كده |
| Levantine | levantine | Lebanon, Syria, Jordan, Palestine | هلق، كتير، شو، هيك، منيح |
| MSA | msa | Pan-Arab (formal) | إن، الذي، بالإضافة، يجب، لذلك |
Dialect detection accuracy: 100% - validated against a 100-comment test corpus spanning all 4 dialects (Gulf, Egyptian, Levantine, MSA) during Phase 0.5 evaluation.
Example classifications
Gulf Arabic
Gulf Arabic
Comment: “وش رايكم بالمحتوى الجديد؟ أنا شايف إنه وايد حلو”Translation: “What do you think of the new content? I think it’s very nice”Markers:
وش (what - Gulf), وايد (very - Gulf), حلو (nice - Gulf)Egyptian Arabic
Egyptian Arabic
Comment: “الفيديو ده جامد أوي، عايز تاني كده”Translation: “This video is amazing, I want more like this”Markers:
ده (this - Egyptian), أوي (very - Egyptian), عايز (I want - Egyptian), كده (like this - Egyptian)Levantine Arabic
Levantine Arabic
Comment: “كتير حلو الفيديو، بس شو القصة ورا الأغنية؟”Translation: “Very nice video, but what’s the story behind the song?”Markers:
كتير (very - Levantine), شو (what - Levantine)Modern Standard Arabic (MSA)
Modern Standard Arabic (MSA)
Comment: “يجب أن نشجع هذا النوع من المحتوى الهادف”Translation: “We should encourage this type of meaningful content”Markers:
يجب أن (must - MSA formal), هذا النوع من (this type of - MSA structure)How it works
NAGL + ALLaM pipeline
NAWA uses a two-stage pipeline called NAGL (NAWA Augmented Generation Layer):- Language detection: Identifies the input language. Arabic text is routed to the ALLaM model via HUMAIN’s API.
- Dialect classification: ALLaM analyzes morphological patterns, vocabulary, and syntactic structures to determine the dialect.
- Confidence scoring: A calibrated confidence score (0–1) indicates how certain the model is about the dialect classification.
ALLaM is developed by HUMAIN (formerly SDAIA) and is the most advanced large language model purpose-built for Arabic. NAWA is an official HUMAIN partner.
The dialect_confidence field
The dialect_confidence score ranges from 0 to 1:
| Range | Interpretation |
|---|---|
| 0.90–1.00 | High confidence - strong dialectal markers present |
| 0.70–0.89 | Medium confidence - some dialectal features detected |
| 0.50–0.69 | Low confidence - text may be code-switched or ambiguous |
| < 0.50 | Very low - text may be too short or use minimal dialectal features |