Text vectorizer that capture feature offset in the text?

Data Science Asked by R Sorek on October 24, 2020

I’m using sklearn Tfifdfvectorizer to extract feature from text towards text classification.
I believe the information I need tends to be in the beginning of the document, so I would like to somehow capture the offset of each feature per document (either of the first appearance, or the mean offset over of all appearances).
Is there some vectorizer that can do that? or some other method of extracting this information efficiently?

Thank you!

One Answer

One approach is to create another matrix that stores this information. Scikit-learn stores text features in a document-by-token matrix. The cells of this matrix would be the token index in a document. This matrix then could be used as features during modeling.

It would require writing a custom vectorizer which would be similar to scikit-learn's CountVector implementation.

Answered by Brian Spiering on October 24, 2020

Add your own answers!

Related Questions

How do I generate a laplacian matrix for a graph dataset?

2  Asked on February 3, 2021 by naveed


Stratified Sampling for XGboost

1  Asked on February 3, 2021 by honeybadger


k-means for customer review analysis

0  Asked on February 3, 2021 by pari-ganjoo


Orange wont allow me to set target for corpus

1  Asked on February 3, 2021 by yousuf


Keras P/R metrics at different thresholds during training

1  Asked on February 3, 2021 by adrian-buzea


Storing and collecting data

1  Asked on February 3, 2021 by ultimatebeginner


How to manage missing data in meteorological time series?

1  Asked on February 3, 2021 by boughrara


Ask a Question

Get help from others!

© 2023 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP