TransWikia.com

How do I filter an XML file to find all of one type of tag with any invalid values?

Super User Asked by GC_ on December 24, 2020

I was messing around with notepad++, but seem to find an easy way to do this. I think grep might work, but I an not totally sure how.

I have a file, that has certain tags, I want to find all of the tags, that have incorrect values. For example:

This is what most of them are.

<tag attr="1">Correct</tag>

However, I want to find all the ones with anything else in them.

<tag attr="1">Wrong</tag>
<tag attr="1">Incorrect</tag>
<tag attr="1">Gibberish</tag>

… etc, etc …

There are thousands of them, but I am just looking for bad ones. I don’t want to look at each manually. Also, more than on tag can be on the same line.

GC

3 Answers

It's better to use a XML parser, but, if you want to use Notepad++, this does the job:

  • Ctrl+F
  • Find what: <tag[^>]*>(?:(?!Correct|</tag>).)*</tag>
  • CHECK Match case
  • CHECK Wrap around
  • CHECK Regular expression
  • UNCHECK . matches newline
  • Find All in Current Document

Explanation:

<tag[^>]*>          # open tag
            # Tempered Greedy Token
(?:                 # non capture group
    (?!                 # negative lookahead, make sure we haven't after:
        Correct             # literally Correct
      |                   # OR
        </tag>              # end tag
    )                   # end lookahead
    .                   # any character
)*                  # end group, may appear 0 or more times
</tag>              # end tag

Screenshot (before):

enter image description here

Screenshot (after):

enter image description here

Correct answer by Toto on December 24, 2020

Find an XML editor that allows XPath searching (I use oXygen), and the query is then //tag[not(.='Correct')].

If you're doing anything with XML, you need to master XPath: working with regular expressions to process XML is inefficient, clumsy, and ultimately it gives the wrong answer - there will always be some way of writing the XML that defeats your regex. For example people doing this with a regex often forget that attributes can be delimited by single quotes rather than double quotes, or that a newline can appear before the ">" in a start tag.

Answered by Michael Kay on December 24, 2020

Use CTRL H (find/replace) with REGEX turned on. Dots are a single wildcard, .* is everything. If you want to work with line breaks, rn or n will be your friend too. What defines a correct tag's contents? Is it always one word, or length?

For example, ....g is regex for literally any tag attr 1 with a contents of 4 charachters and a g.

Second, rn....g is a regex for the same, but there is a new line after the opening tag and before the contents of the tag. Some more details would help zero in on the exact regex for n++ So send more details if needed.

you can also do ()(....g)() to parse out the three sections. $1 etc is how to address the parsed parts. $1$2$3 is literal pasteback.

Answered by Alex Roberts on December 24, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP