Text vectorizer that capture feature offset in the text?

Data Science Asked by R Sorek on October 24, 2020

I’m using sklearn Tfifdfvectorizer to extract feature from text towards text classification.
I believe the information I need tends to be in the beginning of the document, so I would like to somehow capture the offset of each feature per document (either of the first appearance, or the mean offset over of all appearances).
Is there some vectorizer that can do that? or some other method of extracting this information efficiently?

Thank you!

One Answer

One approach is to create another matrix that stores this information. Scikit-learn stores text features in a document-by-token matrix. The cells of this matrix would be the token index in a document. This matrix then could be used as features during modeling.

It would require writing a custom vectorizer which would be similar to scikit-learn's CountVector implementation.

Answered by Brian Spiering on October 24, 2020

Add your own answers!

Related Questions

How to normalise(?) an [x,y] time series data set

0  Asked on August 6, 2020 by nick-grealy


Hive / Impala best practice code structuring

1  Asked on August 2, 2020 by gerardsson


When one model is superior in real world use?

1  Asked on August 1, 2020 by i_play_with_data


Ask a Question

Get help from others!

© 2022 All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP