Tunisian Door

تونسية

Tunisian Arabic Corpus

Using the Corpus

There are three ways to search the corpus: Exact, Lemma, and Regex.

Exact

This will search for the word exactly as you type it. For example: typing يمشي will return all instances of يمشي, but not تمشي or مشيت.

Lemma

This will search for inflected forms of the word. For example: typing مشي will return يمشي، مشيت, etc. and دار will return الدار، دارها, والدار, etc.

  • Enter the basic form of the word, without any conjugations or inflections.
  • Note that the software that performs the lemmatization is not very accurate. So you might want to try searching several different ways.
  • Also, the lemmatization does not take into account word form changes. For example:
    • To find all instances of the verb مشى, you will need to search for both مشي and مشى (as well as spelling variants like مشا).
    • To find all instances of كرهبة, you will need to search for كرهبة and كرهبت (to find words like كرهبتو).

RegEx

Searching with regular expressions allows you to perform sophisticated wildcard searches.

  • To perform a regex search, it's best to enter your search in transliteration, rather than Arabic characters, following the transliteration system given below. While you can enter searches in Arabic script (like في.*), it gets difficult for more complex searches because the regex operators are LTR and the Arabic script is RTL.
  • Do not use the word boundary character (\b) in your search — these are automatically added to the search string.
  • A good primer on regex syntax can be found at the RegEx page on Wikipedia. Some examples:
    • [sS]Hb will return صحب or سحب
    • krhb[pht](\w)* will return كرهبة, كرهب, كرهبتي, etc.
    • (ma)?\w{3,6}J will return مشيتش, يحبوش, مافيباليش, and most other negative verbs and pseudo-verbs. (Currently there's no way to search for word phrases like ما عرفش.)

Transliteration System

Modified version of the Buckwalter system:

Character Arabic Letter Description
cءhamza-on-the-line
Aآmadda
eأhamza-on-'alif
Wؤhamza-on-waaw
Iإhamza-under-'alif
iئhamza-on-yaa'
aاbare 'alif
bبbaa'
pةtaa' marbuuTa
tتtaa'
vثthaa'
jجjiim
HحHaa'
xخkhaa'
dدdaal
Vذdhaal
rرraa'
zزzaay
sسsiin
Jشshiin
SصSaaD
DضDaaD
TطTaa'
ZظZaa' (DHaa')
Cعcayn
Gغghayn
fفfaa'
qقqaaf
gڤgaaf
kكkaaf
lلlaam
mمmiim
nنnuun
hهhaa'
wوwaaw
Eى'alif maqSuura
yيyaa'