TransWikia.com

Spacy tokenizer that strips HTML/XML keeping the positions

Data Science Asked on July 26, 2021

I’m new to Spacy and infact new to Data Science. I would like to process some XML files for NER and then mark that in the original XML. I would like to know how to tokenize the XML for the NER. I think I should be filtering out the XML tags/code and then feeding the remaining text while retaining positions so that I can get the positions of the NER results in relation to the original XML. This would help me identifying and tagging the exact locations in the XML.

Any ideas and guidance on this would be most appreciated. Thanks

One Answer

Generally, XML is first parsed. Then, the contents can be analyzed with something like spaCy.

xml.etree.ElementTree is the most common way to parse XML in Python.

Answered by Brian Spiering on July 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP