Resources and tools

  • AutoSearch
    This demonstrator allows users to define one or more corpora and upload data for the corpora, after which the corpora will be made automatically searchable in a private workspace. Users can upload text data annotated with lemma + part of speech tags in TEI or FoLiA format, either as a single XML file or as an archive (zip or tar.gz) containing several XML files. Corpus size is limited to begin with (25 MB limit per uploaded file; 500,000 token limit for an entire corpus), but these limits may be increased at a later point in time. The search application is powered by the INL BlackLab corpus search engine. The search interface is the same as the one used in for example the Corpus of Contemporary Dutch / Corpus Hedendaags Nederlands.
  • Cornetto-LMF (Lexicon Markup Framework)
    Cornetto is a lexical resource for the Dutch language which combines two resources with different semantic structures. It includes the Dutch Wordnet which organizes words in sets of synonyms (synsets) and records semantic relations between them. It also includes the Dutch Reference Lexicon which organizes words in form-meaning units (lexical entries) and describes them with short definitions, usage constraints, selection restrictions, syntactic behaviours, combinatorial information and illustrative contexts. Cornetto can be considered as the combination of a thesaurus and a dictionary. It is accessible for human use via a web browser and it is also available in XML for computational use (opensourcewordnet). Cornetto has circa 177,000 lexical entries and 70,000 synsets.
  • Corpus of Contemporary Dutch (Corpus Hedendaags Nederlands)
    A collection of more than 800,000 texts taken from newspapers, magazines, news broadcasts and legal writings (1814-2013). The corpus is a combination of the 5, 27 and 38 Million Words Corpora and the PAROLE Corpus, supplemented with newspaper texts from NRC and De Standaard (until 2013).
  • Corpus Gysseling
    The Corpus Gysseling made available here consists of the collection of all thirteenth-century texts that have served as source material for the Early Middle Dutch Dictionary. It is the digital edition, enriched with part of speech and lemma, of the thirteenth-century material from the Corpus of Middle Dutch texts (until the year 1300), issued in the period from 1977 to 1987 by the Ghent linguist Maurits Gysseling.
  • Corpus VU-DNC (VU University Diachronic News text Corpus)
    The VU-DNC Corpus is a diachronic Dutch newspaper corpus (VU Free University Dutch Newspaper Corpus). The corpus consists of data from five newspapers: Algemeen Dagblad, NRC (Handelsblad), De Telegraaf, Trouw and De Volkskrant. For each of the newspapers, data of two years (1950/1951 and 2002) are available. The articles were selected by topic (e.g. headline news, foreign news and sports). Special feature of the corpus is that both the presence of subjective elements in the articles and the presence of direct speech have been annotated. The subjective elements are annotated based on a set of lexical elements (subjectivity lexicon). As a result, the corpus is very useful to linguistically oriented researchers who are interested in diachrony and/or subjectivity and to communication scientists and media scholars who are interested in changing practices regarding the framing of coverage.
  • Dictionary of the Frisian Language (Woordenboek der Friese Taal)
    The "Wurdboek fan de Fryske taal" is a scientific, descriptive dictionary containing about 120,000 entries. The dictionary articles provide information on the spelling, part of speech, pronunciation, inflection, etymology, meaning (illustrated with quotes), compositions and derivations of each keyword, along with idiomatic information (collocations, proverbs and figurative meanings).
  • DuELME-LMF (Lexicon Markup Framework)
    DuELME is a lexicon of more than 5,000 Dutch multiple-word expressions. Expressions with the same syntactic pattern are divided into so-called Equivalence Classes, which makes it possible to integrate the lexicon with minimal manual effort into an NLP system. The lexicon has been developed within the framework of the IRME project. (Documentation.)
  • Language Portal (Taalportaal)
    Taalportaal will create an online portal containing an exhaustive and fully searchable electronic reference of Dutch and Frisian phonology, morphology and syntax. Its content will be in English. The digital design of the portal enables interoperability between the linguistic categories of phonology, morphology and syntax on the one hand, and between the two languages on the other. The portal’s rich crosslinking will benefit these domains of research, which are now often studied in isolation. The Taalportaal project started in 2011 and runs until the end of 2015.
  • Namescape
    Recent research has conclusively proven that names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of, for example, what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. The project aims to fill the need by annotating a substantial amount of literary works with a rich tag set, thereby enabling the participating parties to perform their research in more depth than previously possible. Several exploratory visualization tools will help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.
  • NERD
    NERD (Named Entity Recognition and Disambiguation) is a tool for tagging named entities in Dutch text. It tags persons, locations, organisations, events, products and miscellaneous entities, as described in the accompanying paper. The system was trained on the 1-million-word SoNaR-small corpus, which has publicly available named entity annotations.
  • OpenConvert
    The OpenConvert tools convert to TEI from a number of input formats (alto, text, word, HTML). The tools are available as a Java command line tool, a web service and a web application.
  • OpenSoNaR
    OpenSoNaR is an online system that allows for analyzing and searching the over 500 million word Dutch reference corpus SoNaR developed within the STEVIN programme under the aegis of the Dutch Language Union. It is the result of cooperation between IVDNT, TiCC - Tilburg University and company De Taalmonsters, in CLARIN-NL Call 4 project OpenSoNaR. The system incorporates the texts and metadata of SoNaR-500 and SoNaR New Media corpora. The project's main aim was to facilitate the use of the SoNaR corpus by providing a user-friendly online interface, regardless of the user's personal computer expertise. User groups representing Linguistics, Media and Communication Studies, as well as Literary and Cultural Sciences have provided practical use cases on the basis of which the interface has been developed. The system is available here for use in research and educational and settings.
  • @PhilosTEI
    This system offers an open source, web-based, user-friendly workflow from digital images of text to XML. This provides philosophers and other digital humanities scholars the opportunity to build their own text corpora and critical text editions. This workflow combines Tesseract for text layout analysis and Optical Character Recognition, and a multilingual version of TICCLops (Text-Induced Corpus Clean-up online processing system). TICCLops is also available as a separate web application/service. The current version accepts book pages or other documents in PDF, DjVU or TIFF format and delivers output in FoLiA and TEI XML format. The larger the corpus the system has access to, the better the quality of the correction. This is the result of better word statistics derivable from larger corpora. (Documentation: 2014-1; 2014-2.)
  • TICCLops
    TICCLops (Text-Induced Corpus Clean-up online processing system) performs state-of-the art spelling correction and correction of errors due to Optical Character Recognition for 18 European languages and language varieties. It also transcribes diachronic text into a more modern version. As a spelling correction system it is unique in making principled use of the input text as source to find canonical word forms. This makes it much less domain sensitive than other systems, the domain is covered by the input text collection. The larger the background corpus the system has access to, the better the quality of the correction. The link gives access to the classic CLAM interface. A more modern and user-friendly version is available through the @PhilosTEI interface. (Documentation: 2014-1; 2014-2.)
  • WebCelex
    WebCelex is a webbased interface to the CELEX lexical databases of English, Dutch and German. CELEX was developed as a joint enterprise of the University of Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck Institute for Psycholinguistics in Nijmegen, and the Institute for Perception Research in Eindhoven. For each language, the database contains detailed information on: orthography (variations in spelling, hyphenation), phonology (phonetic transcriptions, variations in pronunciation, syllable structure, primary stress), morphology (derivational and compositional structure, inflectional paradigms), syntax (word class, word class-specific subcategorizations, argument structures) and word frequency (summed word and lemma counts, based on recent and representative text corpora).

Useful links

If you need to convert files from one markup format into another, pandoc is your swiss-army knife.