NLP SBert (Bert) for answer comparison STS

Question

I've been researching a good way to automate short answer evaluation. Essentially a teacher gives a test with some questions like:
Question: why did columbus sail westward to find asia?
Answer: so he could find a new trade route to Asia through the ocean. Three goals of the Spanish in the Americas were the desire to attain great amounts of riches, to establish claims on as much land as possible, and to colonize as much as possible.
With that we have the correct answer and would like to compare that with the students answer and produce a score based on similarity. I know this isn't a reliable replacement for human grading, but for the sake of the example.
I've come across this paper and codebase:
https://arxiv.org/pdf/1908.10084.pdf
https://github.com/UKPLab/sentence-transformers
It seems like the ideal method for solving this problem, but most examples are based on scoring/ranking of semantic search. I question whether I'm on the right path, given that I'm just comparing two answers and not a cluster. Anyone with more experience, possibly can provide some guidance?

Valentas · Answer

I tried GPT-2 with your prompt but I was not extremely successful:

Grzegorz · Answer

I have used Siamese Bert and I can say it does a pretty good job. However, the issue is that the data that it has been fine-tuned atop of Bert may not necessarily, entirely represent the same semantic distance as with the answers between the true and the student's one. For instance, if there is a question about engineering, where a small change of word may mean a totally different thing; SBert would still find them quite similar cause they are related to the topic.  Unless it's fine-tuned.
Moreover, you will not be able to interpret the similarity. Should a student ask you why my peer's answer is better you won't be able to explain.
My opinion: I believe you could use this tool as a way to reduce totally incoherent answers, but at some point, human evaluation will be needed. And maybe use interpretable metrics such as ROUGE or BLEU. I aware as well, that this topic is quite trendy in NLP, I won't be surprised if there is or will be good off the shelf tool for that, but I am not aware of one currently.

Grzegorz · Answer

@b_the_builder Nice finds! The first seems to me like an advancement of the Word Mower's distance by using the similarities between each word. I believe still may lack the domain adaptation. Whereas the second link you provided does the pre-training for that specific reason. All in all, whatever method you use I believe you will need to pick some representative hard match sentences and see how they perform on them, after pre-training on your corpora. If you want to be sure. For inspiration, you can look into here on semantic similarity tasks between sentences.

Answered by Grzegorz on January 7, 2021

NLP SBert (Bert) for answer comparison STS

3 Answers

Add your own answers!

Ask a Question