TransWikia.com

Build text complexity model based on complex examples

Data Science Asked by Vitalii Mishchenko on August 27, 2021

I try to build the user specific model which predicts whether arbitrary English text is complex for particular user or not. Having the complex and easy text samples allows to build such model but what if I have only complex samples. How can I build the model in such case?

I can detect whether the given text is different (find the “outlier”) from those which user marked as difficult. But that information does not tell me in which way it’s different. The text could be easier or more difficult.

Currently I see only one way – make an assumption about how the easy text could look like. But it’s kind of unsafe since different people might have own unique areas which they do not understand in the text.

One Answer

There have been many ways to measure text complexity proposed in the literature, I don't have any particular survey to recommend but google is your friend.

Many of these measures are heuristics, i.e. they work in an unsupervised way. I don't remember the details but I've seen some works using a combination of several of these measures to obtain more accurate results.

A basic way would be to be build a language model on the complex text, measure the complexity against this model for any new text and assume that if it's not similar then it's not complex, but as you rightly noticed it's not a very safe assumption.

At the most basic level, you can use the type token ratio (TTR): divide the number of types (unique tokens) by the total number of tokens. The TTR is a quite good indicator of lexical diversity, so complex text is likely to give a high value. It's a very crude measure but it's useful as a baseline: whatever system you try, if it doesn't give better results than a threshold on the TTR then it's not a good system :)

Answered by Erwan on August 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP