Resources and tools

  • AutoSearch
    This demonstrator allows users to define one or more corpora and upload data for the corpora, after which the corpora will be made automatically searchable in a private workspace. Users can upload text data annotated with lemma + part of speech tags in TEI or FoLiA format, either as a single XML file or as an archive (zip or tar.gz) containing several XML files. Corpus size is limited to begin with (25 MB limit per uploaded file; 500,000 token limit for an entire corpus), but these limits may be increased at a later point in time. The search application is powered by the INL BlackLab corpus search engine. The search interface is the same as the one used in for example the Corpus of Contemporary Dutch / Corpus Hedendaags Nederlands.
  • CGN
    The Corpus Gesproken Nederlands (Corpus Spoken Dutch) is a collection of 900 hours (almost 9 million words) of contemporary spoken Dutch from Flemish and Netherlands native speakers. The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.
  • Cornetto-LMF (Lexicon Markup Framework)
    Cornetto is a lexical resource for the Dutch language which combines two resources with different semantic structures. It includes the Dutch Wordnet which organizes words in sets of synonyms (synsets) and records semantic relations between them. It also includes the Dutch Reference Lexicon which organizes words in form-meaning units (lexical entries) and describes them with short definitions, usage constraints, selection restrictions, syntactic behaviours, combinatorial information and illustrative contexts. Cornetto can be considered as the combination of a thesaurus and a dictionary. It is accessible for human use via a web browser and it is also available in XML for computational use (opensourcewordnet). Cornetto has circa 177,000 lexical entries and 70,000 synsets.
  • Corpus of Contemporary Dutch (Corpus Hedendaags Nederlands)
    A collection of more than 800,000 texts taken from newspapers, magazines, news broadcasts and legal writings (1814-2013). The corpus is a combination of the 5, 27 and 38 Million Words Corpora and the PAROLE Corpus, supplemented with newspaper texts from NRC and De Standaard (until 2013).
  • Corpus Gysseling
    The Corpus Gysseling made available here consists of the collection of all thirteenth-century texts that have served as source material for the Early Middle Dutch Dictionary. It is the digital edition, enriched with part of speech and lemma, of the thirteenth-century material from the Corpus of Middle Dutch texts (until the year 1300), issued in the period from 1977 to 1987 by the Ghent linguist Maurits Gysseling.
  • Corpus VU-DNC (VU University Diachronic News text Corpus)
    The VU-DNC Corpus is a diachronic Dutch newspaper corpus (VU Free University Dutch Newspaper Corpus). The corpus consists of data from five newspapers: Algemeen Dagblad, NRC (Handelsblad), De Telegraaf, Trouw and De Volkskrant. For each of the newspapers, data of two years (1950/1951 and 2002) are available. The articles were selected by topic (e.g. headline news, foreign news and sports). Special feature of the corpus is that both the presence of subjective elements in the articles and the presence of direct speech have been annotated. The subjective elements are annotated based on a set of lexical elements (subjectivity lexicon). As a result, the corpus is very useful to linguistically oriented researchers who are interested in diachrony and/or subjectivity and to communication scientists and media scholars who are interested in changing practices regarding the framing of coverage.
  • Dictionary of the Frisian Language (Woordenboek der Friese Taal)
    The "Wurdboek fan de Fryske taal" is a scientific, descriptive dictionary containing about 120,000 entries. The dictionary articles provide information on the spelling, part of speech, pronunciation, inflection, etymology, meaning (illustrated with quotes), compositions and derivations of each keyword, along with idiomatic information (collocations, proverbs and figurative meanings).
  • DuELME-LMF (Lexicon Markup Framework)
    DuELME is a lexicon of more than 5,000 Dutch multiple-word expressions. Expressions with the same syntactic pattern are divided into so-called Equivalence Classes, which makes it possible to integrate the lexicon with minimal manual effort into an NLP system. The lexicon has been developed within the framework of the IRME project. (Documentation.)
  • GrETEL
    GrETEL is a query engine in which linguists can use a natural language example as a starting point for searching a treebank with limited knowledge about tree representations and formal query languages. By allowing users to search for constructions which are similar to the example they provide, it aims to bridge the gap between traditional and computational linguistics.
  • Language Portal (Taalportaal)
    Taalportaal will create an online portal containing an exhaustive and fully searchable electronic reference of Dutch and Frisian phonology, morphology and syntax. Its content will be in English. The digital design of the portal enables interoperability between the linguistic categories of phonology, morphology and syntax on the one hand, and between the two languages on the other. The portal’s rich crosslinking will benefit these domains of research, which are now often studied in isolation. The Taalportaal project started in 2011 and runs until the end of 2015.
    The Lassy Large Corpus is a collection written texts consisting of approximately 700 million words with automatically generated annotations. The lemmas and POS-tags were generated with Tadpole (now Frog) and the syntactical depency structures were generated with Alpino.
    The Lassy Small Corpus is a corpus of approximately 1 million words with manually verified syntactical annotations. The lemmas and POS-tags were generated with Tadpole (now Frog) ) and the syntactical depency structures were generated with Alpino. The lemmas, POS-tags and syntactic tree structures were manually verified and corrected.
  • Namescape
    Recent research has conclusively proven that names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of, for example, what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. The project aims to fill the need by annotating a substantial amount of literary works with a rich tag set, thereby enabling the participating parties to perform their research in more depth than previously possible. Several exploratory visualization tools will help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.
  • NERD
    NERD (Named Entity Recognition and Disambiguation) is a tool for tagging named entities in Dutch text. It tags persons, locations, organisations, events, products and miscellaneous entities, as described in the accompanying paper. The system was trained on the 1-million-word SoNaR-small corpus, which has publicly available named entity annotations.
  • OpenConvert
    The OpenConvert tools convert to TEI from a number of input formats (alto, text, word, HTML). The tools are available as a Java command line tool, a web service and a web application.
  • OpenSoNaR
    OpenSoNaR is an online system that allows for analyzing and searching the over 500 million word Dutch reference corpus SoNaR developed within the STEVIN programme under the aegis of the Dutch Language Union. It is the result of cooperation between IVDNT, TiCC - Tilburg University and company De Taalmonsters, in CLARIN-NL Call 4 project OpenSoNaR. The system incorporates the texts and metadata of SoNaR-500 and SoNaR New Media corpora. The project's main aim was to facilitate the use of the SoNaR corpus by providing a user-friendly online interface, regardless of the user's personal computer expertise. User groups representing Linguistics, Media and Communication Studies, as well as Literary and Cultural Sciences have provided practical use cases on the basis of which the interface has been developed. The system is available here for use in research and educational and settings.
  • @PhilosTEI
    This system offers an open source, web-based, user-friendly workflow from digital images of text to XML. This provides philosophers and other digital humanities scholars the opportunity to build their own text corpora and critical text editions. This workflow combines Tesseract for text layout analysis and Optical Character Recognition, and a multilingual version of TICCLops (Text-Induced Corpus Clean-up online processing system). TICCLops is also available as a separate web application/service. The current version accepts book pages or other documents in PDF, DjVU or TIFF format and delivers output in FoLiA and TEI XML format. The larger the corpus the system has access to, the better the quality of the correction. This is the result of better word statistics derivable from larger corpora. (Documentation: 2014-1; 2014-2.)
    The STEVIN SoNaR project has resulted in two datasets, viz. SoNaR-500 and SoNaR-1.
    SONAR-500 contains over 500 million words (i.e. word tokens) of full texts from a wide variety of text types including both texts from conventional media and texts from the new media. All texts except for texts from the social media (Twitter, Chat, SMS) have been tokenized, tagged for part of speech and lemmatized, while in the same set the Named Entities have been labelled. In the case of SoNaR-500 all annotations were produced automatically, no manual very verification took place.
    SoNaR-1 is a dataset comprising one million words. Although largely a subset of SoNaR-500, SoNaR-1 includes far fewer text types. With SoNaR-1 different types of semantic annotation have been provided, viz. named entity labelling, annotation of co-reference relations, semantic role labelling and annotation of spatial and temporal relations. All annotations have been manually verified.
  • STylene
    Stylene is a robust, modular system for stylometry and readability research on the basis of existing techniques for automatic text analysis and machine learning, and the development of a web service that allows researchers in the humanities and social sciences to analyze texts with this system. In this way, the project will make available to researchers recent advances in research on the computational modeling of style and readability.
  • TICCLops
    TICCLops (Text-Induced Corpus Clean-up online processing system) performs state-of-the art spelling correction and correction of errors due to Optical Character Recognition for 18 European languages and language varieties. It also transcribes diachronic text into a more modern version. As a spelling correction system it is unique in making principled use of the input text as source to find canonical word forms. This makes it much less domain sensitive than other systems, the domain is covered by the input text collection. The larger the background corpus the system has access to, the better the quality of the correction. The link gives access to the classic CLAM interface. A more modern and user-friendly version is available through the @PhilosTEI interface. (Documentation: 2014-1; 2014-2.)
    TTNWW integrates and makes available existing Language Technology (LT) software components for the Dutch language that have been developed in the STEVIN and CGN projects. The LT components (for text and speech) are made available as web-services in a simplified workflow system that enables researchers without much technical background to use standard LT workflow recipes.
  • WebCelex
    WebCelex is a webbased interface to the CELEX lexical databases of English, Dutch and German. CELEX was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. For each language, the database contains detailed information on: orthography (variations in spelling, hyphenation), phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), morphology (derivational and compositional structure, inflectional paradigms), syntax (word class, word class-specific subcategorizations, argument structures) and word frequency (summed word and lemma counts, based on recent and representative text corpora).

Useful links

If you need to convert files from one markup format into another, pandoc is your swiss-army knife.