Spacy tokenizer that strips HTML/XML keeping the positions

Question

I'm new to Spacy and infact new to Data Science. I would like to process some XML files for NER and then mark that in the original XML. I would like to know how to tokenize the XML for the NER. I think I should be filtering out the XML tags/code and then feeding the remaining text while retaining positions so that I can get the positions of the NER results in relation to the original XML. This would help me identifying and tagging the exact locations in the XML.
Any ideas and guidance on this would be most appreciated. Thanks

Brian Spiering · Answer

Generally, XML is first parsed. Then, the contents can be analyzed with something like spaCy.
xml.etree.ElementTree is the most common way to parse XML in Python.

Spacy tokenizer that strips HTML/XML keeping the positions

One Answer

Add your own answers!

Ask a Question