Open datasets



The current package contains the questions and answers created by citizens using the Citizenpedia in the Trento evaluation of the results of H2020 project SIMPATICO that were undertaken between September 2017 and January 2019.


SIMPATICO Logs Final Trento Dataset v1.0

SIMPATICO logs for the user evaluation of Trento in project iterations 1 and 2 The current package contains the Interaction LOG data captured in the Trento two evaluations of the results of H2020 project SIMPATICO that were undertaken from September 2017 to January 2019.


BenchLS: A Reliable Dataset for Lexical Simplification

To create our dataset we combined two resources: the LexMTurk (Horn et al., 2014) and LSeval (De Belder and Moens, 2012) datasets. The instances in both datasets, 929 in total, contain a sentence, a target complex word, and several candidate substitutions ranked according to their simplicity.


NNSeval: Evaluating Lexical Simplification for Non-Natives

We have conducted a user study to learn more about word complexity for non-native speakers. 400 non-native speakers participated in the experiment, all university students or staff. They were asked to judge whether or not they could understand the meaning of each content word.


BenchPS: A Benchmark Dataset for Phrase Simplification

BenchPS is a dataset built for the training and evaluation of phrase simplification systems. Each instance is composed of a sentence, target complex phrase, and a set of candidate simplifications ranked by simplicity. 


Common20LS: A Lexical Simplification Dataset with Demographic Information

Common20LS is a dataset for the task of Lexical Simplification that contains demographic information about the annotators. It consists on 20 Lexical Simplification problems annotated by 262 people.


SimPA: A Sentence-Level Simplification Corpus for the Public Administration Domain

We present a sentence-level simplification corpus with content from the Public Administration (PA) domain. The corpus contains 1,100 original sentences with manual simplifications collected through a two-stage process.


Simple Italian sentences ranked by readability

The dataset contains 500,000 sentences extracted from the Paisà corpus which have been selected for being easy to read according to four parameters: token number, average word length, depth of the parse tree and verb “arity”.


Italian Lexical Simplification Benchmark

The corpus is a manually created benchmark to evaluate the performance of Italian lexical simplification systems. It contains 901 pairs of complex sentences and their simplified version at the lexical level (i.e. replacement of a difficult term or phrase with a simpler synonym).



The current package contains the questions and answers created by citizens using the Citizenpedia in the Galicia evaluation of the results of H2020 project SIMPATICO that were undertaken on October 2017.



The current package contains the questions and answers created by citizens using the Citizenpedia in the Galicia evaluation of the results of H2020 project SIMPATICO that were undertaken between September 24th 2018 and October 15th 2018.


SIMPITIKI corpus for simplification in Italian

SIMPITIKI is a Simplification corpus for Italian and it consists of two sets of simplified pairs: the first one is harvested from the Italian Wikipedia in a semi-automatic way; the second one is manually annotated sentence-by-sentence from documents in the administrative domain.


SIMPATICO Second Evaluation Galicia Dataset v1.0

SIMPATICO logs for the user evaluation of Galicia in project iteration 2 The current package contains the Interaction LOG data captured in the Galicia evaluation of the results of H2020 project SIMPATICO that were undertaken between September 24th 2018 and October 15th 2018.


SIMPATICO First Evaluation Galicia Dataset v1.1

SIMPATICO logs for the user evaluation of Galicia in project iteration 1.== The current package contains the Interaction LOG data captured in the Galicia evaluation of the results of H2020 project SIMPATICO that were undertaken between October 23rd 2017 and November 3rd 2017.