Tunisian Door

تونسية

Tunisian Arabic Corpus

About the Project

The Tunisiya project is a one-million-word Tunisian Arabic corpus created by Karen McNeil and Miled Faiza. It focuses on the prestige variety of Tunisian Arabic spoken in Tunis, with occasional representation of regional differences. A key strength of the corpus is its inclusion of digitized print materials, such as novels, which provide a rich resource for studying written Tunisian Arabic. This corpus supports a wide range of linguistic research and practical applications, offering an in-depth look at the language's structure and usage. The corpus is freely available for research and educational purposes, and we encourage researchers to explore its contents and contribute to the field of Arabic dialectology.

The corpus currently contains 1,148,405 words in 12 categories.

Category Breakdown

Updates

New Site Launched!

Jan. 9, 2025

As you can see, the new version of the site is finally live. Our apologies that it's taken so long (and that we had several weeks of downtime). What can I say, we're busy people. :-D

We hope that everyone enjoys the more modern look of the site and the better performance. Some of the improvements you might notice are:

  • Searches results are returned a lot faster
  • There are pagination controls on the search results page to break up numerous results
  • Metadata about the texts (category, title, author, date) is displayed in the search results page
  • The concordance line includes an option to see more context (200 words)
  • Searches can be limited by category and multiple categories can be selected.

In addition, there have been many improvements on the backend that make the application faster and more robust. Upcoming features that did not make it into this first release but that we're hoping to add soon include:

  • A progress bar/spinner to show when a search is still being processed. Even though the searches are faster, if you search for a common word that has thousands of results, it will take a moment to complete. Right now there's no indicator of progress, or anything to keep the user from hitting the "Search" button multiple times (since it looks like it's not doing anything). We want to fix that.
  • Entire corpus request form. We want to give computational linguists a way to request a full version of the corpus for their research.(Since we are bad at responding to such requests over email.)
  • Search results sorting. Giving the user the ability to sort search results by word before and word after.
  • Improvements in the quality of the texts. The materials that were most recently added (the novels and short stories) included many typos and other errors. We plan to hire people to fix them.
  • An improved parser/lemmatizer. The accuracy of the current software that determines the uninflected "lemma" of a word is not very high. Generative AI is getting pretty good at doing that, even for languages like Tunisian Arabic, so I'd like to experiment with building an LLM-based parser.

What do you think about the new site, and what other features would you like to see? Please let us know.!

Site Under Maintenance

Dec. 20, 2024

Apologies to those who have visited the site recently and been greeted with an error message. The site was built using (what is now) a very old version of Python, and it is no longer supported by our web host. We're currently working on rebuilding the site with a newer version, and it will be new and improved. Sorry for the inconvenience!

New look and functionality coming soon

Jan. 19, 2024

The software that the current website runs on is now somewhat old, and my web server is going to stop supporting it at the end of February. So I'm working on rebuilding the corpus website with the newest versions of Python, Django, etc. I will be launching it in about a month (~mid-February). I expect there to be minimal downtime when I make the switch to the new site. The look will be different (sleeker, faster, and more modern), but the site capabilities will be mostly the same. I will be implementing a new and improved version of the parser, however, so results of "stem" searches may be slightly different. If you're currently working with a results set, you might want to download it before the change.

Novels are in the process of being added!

Dec. 30, 2021

It's taken longer than we hoped, but we have been slowly adding the Tunisian novels to the corpus. And, thanks to a small grant from the Georgetown University Graduate School, we are getting all the rest of them transcribed during this winter break. Some lovely folks on Amazon Mechanical Turk, as well as some students in Tunisia, are working on the transcriptions now; as soon as they're done I will QC them and get them loaded onto the corpus site. We have eight novels, four children's books, and three translations that will be added to the "Literature" category.

New Material Soon to Be Added

Dec. 16, 2019

Over the summer, we collected twelve full-length books written in Tunisian Arabic: eight novels, three translations and some short story collections. We are currently seeking funding to digitize them, and will post another update once that is done and they are added to the corpus.