New Site Launched!
Jan. 9, 2025
As you can see, the new version of the site is finally live. Our apologies that it's taken so long (and that we had several weeks of downtime). What can I say, we're busy people. :-D
We hope that everyone enjoys the more modern look of the site and the better performance. Some of the improvements you might notice are:
- Searches results are returned a lot faster
- There are pagination controls on the search results page to break up numerous results
- Metadata about the texts (category, title, author, date) is displayed in the search results page
- The concordance line includes an option to see more context (200 words)
- Searches can be limited by category and multiple categories can be selected.
In addition, there have been many improvements on the backend that make the application faster and more robust. Upcoming features that did not make it into this first release but that we're hoping to add soon include:
- A progress bar/spinner to show when a search is still being processed. Even though the searches are faster, if you search for a common word that has thousands of results, it will take a moment to complete. Right now there's no indicator of progress, or anything to keep the user from hitting the "Search" button multiple times (since it looks like it's not doing anything). We want to fix that.
- Entire corpus request form. We want to give computational linguists a way to request a full version of the corpus for their research.(Since we are bad at responding to such requests over email.)
- Search results sorting. Giving the user the ability to sort search results by word before and word after.
- Improvements in the quality of the texts. The materials that were most recently added (the novels and short stories) included many typos and other errors. We plan to hire people to fix them.
- An improved parser/lemmatizer. The accuracy of the current software that determines the uninflected "lemma" of a word is not very high. Generative AI is getting pretty good at doing that, even for languages like Tunisian Arabic, so I'd like to experiment with building an LLM-based parser.
What do you think about the new site, and what other features would you like to see? Please let us know.!
New look and functionality coming soon
Jan. 19, 2024
The software that the current website runs on is now somewhat old, and my web server is going to stop supporting it at the end of February. So I'm working on rebuilding the corpus website with the newest versions of Python, Django, etc. I will be launching it in about a month (~mid-February).
I expect there to be minimal downtime when I make the switch to the new site. The look will be different (sleeker, faster, and more modern), but the site capabilities will be mostly the same. I will be implementing a new and improved version of the parser, however, so results of "stem" searches may be slightly different. If you're currently working with a results set, you might want to download it before the change.
Novels are in the process of being added!
Dec. 30, 2021
It's taken longer than we hoped, but we have been slowly adding the Tunisian novels to the corpus. And, thanks to a small grant from the Georgetown University Graduate School, we are getting all the rest of them transcribed during this winter break. Some lovely folks on Amazon Mechanical Turk, as well as some students in Tunisia, are working on the transcriptions now; as soon as they're done I will QC them and get them loaded onto the corpus site. We have eight novels, four children's books, and three translations that will be added to the "Literature" category.
New Material Soon to Be Added
Dec. 16, 2019
Over the summer, we collected twelve full-length books written in Tunisian Arabic: eight novels, three translations and some short story collections. We are currently seeking funding to digitize them, and will post another update once that is done and they are added to the corpus.
Quarter million new words added to corpus
April 14, 2018
<p>Thanks to some new large texts and technological improvements that enabled the parsing of previously unanalyzed texts, we have now added almost 250,000 parsed words to the corpus.</p>
<p>Special thanks to Emna Souissi, Assistant Professor of Computer Science at ENSIT (University of Tunis) for her contribution of a 25,000 corpus of SMS and Facebook communications. Anyone who uses this data should, in addition to <a href="/cite/">citing the Tunisian corpus</a>, cite her corpus as well:</p>
<ul><li>Jihene Younes, Hadhemi Achour and Emna Souissi, "Constructing linguistic resources for the Tunisian dialect using textual user-generated contents on the social web" In <i>Proceedings of the 1st International Workshop on Natural Language Processing for Informal Text (NLPIT 2015) In conjunction with The International Conference on Web Engineering (ICWE 2015)</i>, Rotterdam, The Netherlands, 2015, pp. 3-14.</li></ul>
Corpus is Back Up!
March 31, 2018
I've finished the move to the new webhost (as well as upgrading the code from Python2 to Python3). I'm still ironing out a few bugs, but everything should be mostly working. This new setup will be much easier to maintain and edit, so expect some improvements to the site coming soon!
Corpus Temporarily Down
March 23, 2018
We're in the process of migrating the corpus to another webhost, and the database is going to be down during this transition. Our apologies, and we will have the data back up as soon as possible. Email karenlmcneil@gmail.com if you have an urgent request.
Download Function Improved
April 29, 2016
The download function has been improved, so that the elements (before context, search term, and after context) appear in the correct order.
Corpus To Be Presented at University of Vienna: July 6, 2015
June 23, 2015
Karen will be giving the keynote address at the <a href="http://orientalistik.univie.ac.at/aktuelles/einzelansicht/article/international-symposium-on-tunisian-and-libyan-arabic-dialects-6th-to-8th-july-2015/">International Symposium on Tunisian and Libyan Arabic Dialects</a>, at the University of Vienna on July 6, 2015. Her presentation is entitled "Tunisian Arabic Corpus: Creating a Written Corpus of an "Unwritten" Language." She will also be presenting separately about her research on the use of <i>fī</i> ("in") as a marker of the progressive verbal aspect in Tunisian (and Libyan) Arabic. This work was informed by data from the corpus.
Problem with Search Function
March 23, 2015
Earlier today the corpus was returning erroneous empty results. It's working now, but if anyone experiences problems like this, please <a mailto:karenlmcneil@gmail.com>contact us</a>. Thank you!
Corpus Presented at Brown University Digital Humanities Workshop
Oct. 18, 2014
Karen had an opportunity to present a poster about the corpus at Brown's Digital Islamic Humanities Workshop. Here's the handout, which provides a brief overview of the project and its current status: TACHandout.pdf.
Search Tool Improved
May 30, 2014
<p>There were several improvements made to the search tool:</p>
<ul>
<li>A "category" field was added, so you can filter results by text category.</li>
<li>Bug Fix: Added validation to the form, so that it will not allow users to submit empty queries (which used to lead to errors)</li>
<li>Bug Fix: Added validation to check that any regular expression entered is valid.</li>
</ul>
<p>Right now the new search tool is only here on the index page. There were some difficulties adding it to the corpus results page, but we'll try to straighten them out in the next update.</p>
Google Chrome Problem Fixed
Oct. 10, 2012
It was brought to our attention that the concordance results were not displaying correctly in Google Chrome. The issue has now been fixed.
Stability Improved
Aug. 31, 2012
We've added a test server, to validate any changes before they go live. So if you've visited the site and been greeted with an unpleasant error message, this should ensure that that doesn't happen anymore.
Large Amount of Web Data Added
Aug. 30, 2012
A large number of internet texts have been added to the corpus, using WebBootCaT (through Sketch Engine). These texts will need to be de-duped, and may contain non-Tunisian material, but at a first pass they seem to be largely Tunisian. They come from blogs, forum postings, YouTube comments, and other informal sources. There's also some erotic fiction (expanding the breadth of vocabulary represented into previously uncovered teritory), and there may be other fiction as well. This would be a great addition to the corpus, since there is no prose fiction currently represented, with the exception of folktales. In addition to being a welcome addition in and of themselves, these new texts will also provide the sites (especially blogs) where more Tunisian texts can be gathered.
Results Now Downloadable
July 24, 2012
A link has been added to the concordance page which allows the search results to be downloaded as a .cvs file. The cvs file can then be opened up in Microsoft Excel or any text editer for further analysis.
Search Capability Upgraded
July 23, 2012
The search tool has been upgraded with a morphological parser, allowing users to search for words by the stem and get results for all inflected forms. The parser currently has an accuracy of 88% (recall: 0.868, precision: 0.970, F-score: 0.916). Future versions of the parser will attempt to improve this accuracy.
The parser is a rule-based parser, with some additional statistical processing to improve results. For more details on the internal workings of the parser and how it was developed, an informal paper on the topic is available.
Tunisian Arabic Corpus presented at Arabic Corpus Linguistics workshop
April 12, 2011
Karen and Miled presented a paper on the corpus project at the Arabic Corpus Linguistics workshop at Lancaster University in England.
Tunisian Arabic Corpus presented at Jil Jadid conference
Feb. 19, 2011
<p>Karen gave a presentation about the Tunisian Arabic Corpus project at the Jil Jadid conference at University of Texas, Austin. A video of the presentation is available here:</p>
<br />
<iframe width="560" height="315" src="https://www.youtube.com/embed/ctY0VP_WWNk?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>