AnswerBun.com

Text vectorizer that capture feature offset in the text?

Data Science Asked by R Sorek on October 24, 2020

I’m using sklearn Tfifdfvectorizer to extract feature from text towards text classification.
I believe the information I need tends to be in the beginning of the document, so I would like to somehow capture the offset of each feature per document (either of the first appearance, or the mean offset over of all appearances).
Is there some vectorizer that can do that? or some other method of extracting this information efficiently?

Thank you!

One Answer

One approach is to create another matrix that stores this information. Scikit-learn stores text features in a document-by-token matrix. The cells of this matrix would be the token index in a document. This matrix then could be used as features during modeling.

It would require writing a custom vectorizer which would be similar to scikit-learn's CountVector implementation.

Answered by Brian Spiering on October 24, 2020

Add your own answers!

Related Questions

How do I generate a laplacian matrix for a graph dataset?

2  Asked on February 3, 2021 by naveed

 

Stratified Sampling for XGboost

1  Asked on February 3, 2021 by honeybadger

   

k-means for customer review analysis

0  Asked on February 3, 2021 by pari-ganjoo

       

Orange wont allow me to set target for corpus

1  Asked on February 3, 2021 by yousuf

   

Keras P/R metrics at different thresholds during training

1  Asked on February 3, 2021 by adrian-buzea

   

Storing and collecting data

1  Asked on February 3, 2021 by ultimatebeginner

   

How to manage missing data in meteorological time series?

1  Asked on February 3, 2021 by boughrara

 

Ask a Question

Get help from others!

© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP