TransWikia.com

How to create a Document Categorization Classifier for different contexts of Documents

Data Science Asked on July 9, 2021

I have a doubt solving a test. The idea here is to demonstrate the NLP and Machine Translation abilities.

The Dataset is a multilingual, multi-context set of documents. The dataset is divided on context categories (Wikipedia, conference_papers, Amazon Reviews, etc.,) and on languages.

The objective is to create a document cartegorization classifier (in Python) for the different contexts of the documents. The classifier has to be done at context level, regardless of the language the documents are written in.

An important fact is that The dataset original has been modified and a document Never is repeated in 2 languages.

I have 2 ideas on mind to solve that:

  1. Train on all the data creating a multilingual classifier
  2. Doing language detection first and use monolingual models later.

What could be a reasonable approach to doing text classification for multiple languages?

One Answer

This depends on whether you have enough training data for all languages. If yes, doing language ID and language-specific models might be a good choice, especially if there is a BERT-like model available for each language.

An alternative would be, do the language ID and than machine-translate the input into English and only train an English classifier. You can use e..g, high-quality pre-trained Marian models recently published by the University of Helsinki.

Otherwise, I would use pre-trained multilingual representations (probably XLM-R that is much better than Multilingual BERT) to get representation and train a single classifier for all languages. The multilingual representations seem to have even some zero-shot abilities, i.e., the classifiers seem to generalize even for languages that are no in the training data.

Correct answer by Jindřich on July 9, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP