Tunisian Door

تونسية

Tunisian Arabic Corpus

About the Project

The Tunisiya project is a one-million-word Tunisian Arabic corpus created by Karen McNeil and Miled Faiza. It focuses on the prestige variety of Tunisian Arabic spoken in Tunis, with occasional representation of regional differences. A key strength of the corpus is its inclusion of digitized print materials, such as novels, which provide a rich resource for studying written Tunisian Arabic. This corpus supports a wide range of linguistic research and practical applications, offering an in-depth look at the language's structure and usage. The corpus is freely available for research and educational purposes, and we encourage researchers to explore its contents and contribute to the field of Arabic dialectology.

The corpus currently contains 1,189,303 words in 12 categories.

Category Breakdown

Updates

New AI-assisted transcription and correction

July 12, 2025

When we had the Tunisian novels transcribed on MechTurk a few years ago, there were a lot of mistakes in the transcripts. Some were so bad (and I didn't have time to manually correct them), that I never added them to the corpus at all.

I've been working on an AI-assisted workflow for correcting the transcripts, and I've just finished it and processed three novels through it. It's been a very interesting and successful project:

  • The original transcripts had a word error rate (WER) of about 7-8% for the normal books (مذكرات الشابي بالدارجة and ومن الحب ما فشل). After processing them through the AI system (using GPT-4o), the error rate decreased to 2%
  • The most challenging book, الامير الصغرون, had an original WER of over 23% (!), so almost one of every four words was written incorrectly. This book was so challenging for the transcribers because it had a difficult-to-read font and was fully voweled. So the (non-Tunisian) transcribers often made mistakes like misreading a ت with a sakoun above it as a ث. This book was also more difficult for the AI as well: I found I got the best results by using a more advanced model, GPT-4.5-preview. (This model is much more expensive, $75 per one million tokens versus $2.50 per 1M tokens for 4o, but the books are small enough that the total cost was reasonable.) After some clever prompt engineering and testing different models, the final AI-corrected error rate was 2.16%, one-tenth the original error rate!
  • Here's a chart showing the original and corrected word error rate (WER): WER Comparison between Original and Corrected Transcripts
  • There are some other transcripts that haven't been added to the corpus that I'm going to run through the corrector and add. Also, I'm going to correct some books that were added but that I know have mistakes (like أسرار عائلية).

    We're in Tunisia now, so I'm going to also search out new books that have been published in derja over the last couple of years. Faten Fazaa, mashallah, has kept up her pace of a new novel every year, and there are others as well. Whatever I find, I'll have scanned and then build a similar AI-system to transcribe the novels from the scans. We'll see if that is as accurate as the corrected transcripts. As long as the error rate is below 3%, it should be good enough for most purposes.

New Site Launched!

Jan. 9, 2025

As you can see, the new version of the site is finally live. Our apologies that it's taken so long (and that we had several weeks of downtime). What can I say, we're busy people. :-D

We hope that everyone enjoys the more modern look of the site and the better performance. Some of the improvements you might notice are:

  • Searches results are returned a lot faster
  • There are pagination controls on the search results page to break up numerous results
  • Metadata about the texts (category, title, author, date) is displayed in the search results page
  • The concordance line includes an option to see more context (200 words)
  • Searches can be limited by category and multiple categories can be selected.

In addition, there have been many improvements on the backend that make the application faster and more robust. Upcoming features that did not make it into this first release but that we're hoping to add soon include:

  • A progress bar/spinner to show when a search is still being processed. Even though the searches are faster, if you search for a common word that has thousands of results, it will take a moment to complete. Right now there's no indicator of progress, or anything to keep the user from hitting the "Search" button multiple times (since it looks like it's not doing anything). We want to fix that.
  • Entire corpus request form. We want to give computational linguists a way to request a full version of the corpus for their research.(Since we are bad at responding to such requests over email.)
  • Search results sorting. Giving the user the ability to sort search results by word before and word after.
  • Improvements in the quality of the texts. The materials that were most recently added (the novels and short stories) included many typos and other errors. We plan to hire people to fix them.
  • An improved parser/lemmatizer. The accuracy of the current software that determines the uninflected "lemma" of a word is not very high. Generative AI is getting pretty good at doing that, even for languages like Tunisian Arabic, so I'd like to experiment with building an LLM-based parser.

What do you think about the new site, and what other features would you like to see? Please let us know.!

Site Under Maintenance

Dec. 20, 2024

Apologies to those who have visited the site recently and been greeted with an error message. The site was built using (what is now) a very old version of Python, and it is no longer supported by our web host. We're currently working on rebuilding the site with a newer version, and it will be new and improved. Sorry for the inconvenience!

New look and functionality coming soon

Jan. 19, 2024

The software that the current website runs on is now somewhat old, and my web server is going to stop supporting it at the end of February. So I'm working on rebuilding the corpus website with the newest versions of Python, Django, etc. I will be launching it in about a month (~mid-February). I expect there to be minimal downtime when I make the switch to the new site. The look will be different (sleeker, faster, and more modern), but the site capabilities will be mostly the same. I will be implementing a new and improved version of the parser, however, so results of "stem" searches may be slightly different. If you're currently working with a results set, you might want to download it before the change.

Novels are in the process of being added!

Dec. 30, 2021

It's taken longer than we hoped, but we have been slowly adding the Tunisian novels to the corpus. And, thanks to a small grant from the Georgetown University Graduate School, we are getting all the rest of them transcribed during this winter break. Some lovely folks on Amazon Mechanical Turk, as well as some students in Tunisia, are working on the transcriptions now; as soon as they're done I will QC them and get them loaded onto the corpus site. We have eight novels, four children's books, and three translations that will be added to the "Literature" category.