Tunisian Arabic Corpus

Search for any word in the corpus:

For more information on how to use this search tool, please see the Using the Corpus.

There are several limitations to this system. The accuracy of the stemmer is currently 88%, which means that approximately one out of ten of the words are parsed incorrectly. For this reason, it's a good idea to try searching both for stems and lemmas. The parser does not parse plurals (so بنت and بنات are listed as two separate stems). In addition, the parser does not account for spelling changes, so عليها and عليه will be under the stem علي, separate from على, and كرهبتها will be under the stem كرهبت, separate from the stem for الكرهبة. (Code to correct for orthographic changes is planned for the next version of the parser.)

Project Status

There are currently 2,449 texts in the corpus, comprising 1,082,375 words. The top categories currently included are displayed below. As you can see, the internet sources are currently dominant ("Web" is a category for materials that have been harvested from the internet but not yet put into more specific categories.)

Blog

323,331

Fiction

234,658

TV/Movies/Plays

216,914

Forums

95,511

Web

76,701

Folktales/Proverbs

48,489

Spoken

25,495

SMS/Facebook

25,203

Newspaper

16,107

Non-fiction

10,439

Miscellaneous

9,123

Poetry

404

New look and functionality coming soon

Friday, January 19, 2024

The software that the current website runs on is now somewhat old, and my web server is going to stop supporting it at the end of February. So I'm working on rebuilding the corpus website with the newest versions of Python, Django, etc. I will be launching it in about a month (~mid-February). I expect there to be minimal downtime when I make the switch to the new site. The look will be different (sleeker, faster, and more modern), but the site capabilities will be mostly the same. I will be implementing a new and improved version of the parser, however, so results of "stem" searches may be slightly different. If you're currently working with a results set, you might want to download it before the change.

Novels are in the process of being added!

Thursday, December 30, 2021

It's taken longer than we hoped, but we have been slowly adding the Tunisian novels to the corpus. And, thanks to a small grant from the Georgetown University Graduate School, we are getting all the rest of them transcribed during this winter break. Some lovely folks on Amazon Mechanical Turk, as well as some students in Tunisia, are working on the transcriptions now; as soon as they're done I will QC them and get them loaded onto the corpus site. We have eight novels, four children's books, and three translations that will be added to the "Literature" category.

New Material Soon to Be Added

Monday, December 16, 2019

Over the summer, we collected twelve full-length books written in Tunisian Arabic: eight novels, three translations and some short story collections. We are currently seeking funding to digitize them, and will post another update once that is done and they are added to the corpus.

Quarter million new words added to corpus

Saturday, April 14, 2018

Thanks to some new large texts and technological improvements that enabled the parsing of previously unanalyzed texts, we have now added almost 250,000 parsed words to the corpus.

Special thanks to Emna Souissi, Assistant Professor of Computer Science at ENSIT (University of Tunis) for her contribution of a 25,000 corpus of SMS and Facebook communications. Anyone who uses this data should, in addition to citing the Tunisian corpus, cite her corpus as well:

Jihene Younes, Hadhemi Achour and Emna Souissi, "Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web" In Proceedings of the 1st International Workshop on Natural Language Processing for Informal Text (NLPIT 2015) In conjunction with The International Conference on Web Engineering (ICWE 2015), Rotterdam, The Netherlands, 2015, pp. 3-14.

Corpus is Back Up!

Saturday, March 31, 2018

I've finished the move to the new webhost (as well as upgrading the code from Python2 to Python3). I'm still ironing out a few bugs, but everything should be mostly working. This new setup will be much easier to maintain and edit, so expect some improvements to the site coming soon!

Corpus Temporarily Down

Friday, March 23, 2018

We're in the process of migrating the corpus to another webhost, and the database is going to be down during this transition. Our apologies, and we will have the data back up as soon as possible. Email karenlmcneil@gmail.com if you have an urgent request.

Download Function Improved

Friday, April 29, 2016

The download function has been improved, so that the elements (before context, search term, and after context) appear in the correct order.

Corpus To Be Presented at University of Vienna: July 6, 2015

Tuesday, June 23, 2015

Karen will be giving the keynote address at the International Symposium on Tunisian and Libyan Arabic Dialects, at the University of Vienna on July 6, 2015. Her presentation is entitled "Tunisian Arabic Corpus: Creating a Written Corpus of an "Unwritten" Language." She will also be presenting separately about her research on the use of fī ("in") as a marker of the progressive verbal aspect in Tunisian (and Libyan) Arabic. This work was informed by data from the corpus.

Problem with Search Function

Monday, March 23, 2015

Earlier today the corpus was returning erroneous empty results. It's working now, but if anyone experiences problems like this, please contact us. Thank you!

Corpus Presented at Brown University Digital Humanities Workshop

Saturday, October 18, 2014

Karen had an opportunity to present a poster about the corpus at Brown's Digital Islamic Humanities Workshop. Here's the handout, which provides a brief overview of the project and its current status: TACHandout.pdf.

Search Tool Improved

Friday, May 30, 2014

There were several improvements made to the search tool:

A "category" field was added, so you can filter results by text category.
Bug Fix: Added validation to the form, so that it will not allow users to submit empty queries (which used to lead to errors)
Bug Fix: Added validation to check that any regular expression entered is valid.

Right now the new search tool is only here on the index page. There were some difficulties adding it to the corpus results page, but we'll try to straighten them out in the next update.

Google Chrome Problem Fixed

Wednesday, October 10, 2012

It was brought to our attention that the concordance results were not displaying correctly in Google Chrome. The issue has now been fixed.

Stability Improved

Friday, August 31, 2012

We've added a test server, to validate any changes before they go live. So if you've visited the site and been greeted with an unpleasant error message, this should ensure that that doesn't happen anymore.

Large Amount of Web Data Added

Thursday, August 30, 2012

A large number of internet texts have been added to the corpus, using WebBootCaT (through Sketch Engine). These texts will need to be de-duped, and may contain non-Tunisian material, but at a first pass they seem to be largely Tunisian. They come from blogs, forum postings, YouTube comments, and other informal sources. There's also some erotic fiction (expanding the breadth of vocabulary represented into previously uncovered teritory), and there may be other fiction as well. This would be a great addition to the corpus, since there is no prose fiction currently represented, with the exception of folktales. In addition to being a welcome addition in and of themselves, these new texts will also provide the sites (especially blogs) where more Tunisian texts can be gathered.

Results Now Downloadable

Tuesday, July 24, 2012

A link has been added to the concordance page which allows the search results to be downloaded as a .cvs file. The cvs file can then be opened up in Microsoft Excel or any text editer for further analysis.

Search Capability Upgraded

Monday, July 23, 2012

The search tool has been upgraded with a morphological parser, allowing users to search for words by the stem and get results for all inflected forms. The parser currently has an accuracy of 88% (recall: 0.868, precision: 0.970, F-score: 0.916). Future versions of the parser will attempt to improve this accuracy. The parser is a rule-based parser, with some additional statistical processing to improve results. For more details on the internal workings of the parser and how it was developed, an informal paper on the topic is available.

Tunisian Arabic Corpus presented at Arabic Corpus Linguistics workshop

Tuesday, April 12, 2011

Karen and Miled presented a paper on the corpus project at the Arabic Corpus Linguistics workshop at Lancaster University in England.

Tunisian Arabic Corpus presented at Jil Jadid conference

Saturday, February 19, 2011

Karen gave a presentation about the Tunisian Arabic Corpus project at the Jil Jadid conference at University of Texas, Austin. A video of the presentation is available here: