Data Science Asked by R Sorek on October 24, 2020
I’m using sklearn Tfifdfvectorizer to extract feature from text towards text classification.
I believe the information I need tends to be in the beginning of the document, so I would like to somehow capture the offset of each feature per document (either of the first appearance, or the mean offset over of all appearances).
Is there some vectorizer that can do that? or some other method of extracting this information efficiently?
Thank you!
One approach is to create another matrix that stores this information. Scikit-learn stores text features in a document-by-token matrix. The cells of this matrix would be the token index in a document. This matrix then could be used as features during modeling.
It would require writing a custom vectorizer which would be similar to scikit-learn's CountVector implementation.
Answered by Brian Spiering on October 24, 2020
1 Asked on February 4, 2021 by delforge
1 Asked on February 4, 2021
1 Asked on February 4, 2021
0 Asked on February 4, 2021 by lauramvp
2 Asked on February 4, 2021 by prnai
anomaly detection class imbalance classification machine learning scikit learn
2 Asked on February 3, 2021 by naveed
1 Asked on February 3, 2021
0 Asked on February 3, 2021 by pari-ganjoo
1 Asked on February 3, 2021 by himadri
1 Asked on February 3, 2021 by kradant
2 Asked on February 3, 2021
linear algebra linear regression regression supervised learning
1 Asked on February 3, 2021 by adrian-buzea
2 Asked on February 3, 2021 by dima
data science model distribution feature scaling machine learning statistics
1 Asked on February 3, 2021 by tildekara
1 Asked on February 3, 2021 by boughrara
Get help from others!
Recent Answers
Recent Questions
© 2023 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP