How do I filter an XML file to find all of one type of tag with any invalid values?

Question

I was messing around with notepad++, but seem to find an easy way to do this. I think grep might work, but I an not totally sure how.

I have a file, that has certain tags, I want to find all of the tags, that have incorrect values. For example:

This is what most of them are.

<tag attr="1">Correct</tag>

However, I want to find all the ones with anything else in them.

<tag attr="1">Wrong</tag>
<tag attr="1">Incorrect</tag>
<tag attr="1">Gibberish</tag>

… etc, etc …

There are thousands of them, but I am just looking for bad ones. I don’t want to look at each manually. Also, more than on tag can be on the same line.

GC

find grep notepad++windows xml

Toto · Accepted Answer

It's better to use a XML parser, but, if you want to use Notepad++, this does the job:

Ctrl+F
Find what: <tag[^>]*>(?:(?!Correct|</tag>).)*</tag>
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Find All in Current Document

Explanation:

<tag[^>]*>          # open tag
            # Tempered Greedy Token
(?:                 # non capture group
    (?!                 # negative lookahead, make sure we haven't after:
        Correct             # literally Correct
      |                   # OR
        </tag>              # end tag
    )                   # end lookahead
    .                   # any character
)*                  # end group, may appear 0 or more times
</tag>              # end tag

Screenshot (before):

Screenshot (after):

Michael Kay · Answer

Find an XML editor that allows XPath searching (I use oXygen), and the query is then //tag[not(.='Correct')].
If you're doing anything with XML, you need to master XPath: working with regular expressions to process XML is inefficient, clumsy, and ultimately it gives the wrong answer - there will always be some way of writing the XML that defeats your regex. For example people doing this with a regex often forget that attributes can be delimited by single quotes rather than double quotes, or that a newline can appear before the ">" in a start tag.

Alex Roberts · Answer

Use CTRL H (find/replace) with REGEX turned on. Dots are a single wildcard, .* is everything. If you want to work with line breaks, rn or n will be your friend too. What defines a correct tag's contents? Is it always one word, or length?
For example, ....g is regex for literally any tag attr 1 with a contents of 4 charachters and a g.
Second, rn....g  is a regex for the same, but there is a new line after the opening tag and before the contents of the tag. Some more details would help zero in on the exact regex for n++ So send more details if needed.
you can also do ()(....g)() to parse out the three sections. $1 etc is how to address the parsed parts. $1$2$3 is literal pasteback.

How do I filter an XML file to find all of one type of tag with any invalid values?

3 Answers

Add your own answers!

Ask a Question