TransWikia.com

Extracting text from a capture group while using an arrayformula to consider multiple criteria

Web Applications Asked on November 3, 2021

I have a page which contains a list of regular expressions in the format "some text (thing I want) but only if (thing is matched here)" that are fairly varied. There are 29 of them so far, and I’d prefer not to put them into a single formula, though that is my fallback. A sample of what I’m doing is here.

I have tried several techniques, for example I used textjoin() to concatenate all my conditions and am able to correctly get a match – that is just the true/false that this is valid – but I am unable to then perform the corresponding extract because I don’t know what row I’ve matched on. I thought this would be the best way to go, but other formulas like VLOOKUP can’t be used with a regular expression so I’m uncertain how to obtain that data.

The closest I’ve gotten is shown here, that returns the thing I want but the other groups as well.

=textjoin("",true,arrayformula(if(iserror(REGEXEXTRACT(E2,'Criteria'!B2:B)),"",choose(1,REGEXEXTRACT(E2,'Criteria'!B2:B)))))

I’m using textjoin so that the result isn’t overwritten by "" by non matches on other lines and my expectation was that choose would restrict the textjoin only to the first element but this is not the output I’m seeing.

Thoughts on how to extract only the pattern match for "thing I want"?

2 Answers

The problem with the formula is the usual problem of not being careful about data types: CHOOSE() requires a list as inputs. The return value of REGEXEXTRACT() is a single object that contains a row of cells. To pick an individual cell from a row one must use INDEX().

However, the result of an ARRAYFORMULA of REGEXEXTRACTs is multiple rows and while INDEX() can pick a row and column, it is not obvious which is the correct one to pick. This is for the same reason we can't use a LOOKUP(), we don't know which regular expression matched.

To flatten the structure we can use TEXTJOIN() with a delimiter and importantly with the ignore empty flag set to true. The cells that were generated when REGEXEXTRACT did not match are removed, and we are left with a row that contains only the matching cells.

Since we are back to a single row, using INDEX() is practical again.

=index(split(textjoin("|",true,arrayformula(iferror(REGEXEXTRACT(E3,Criteria!$A$2:$A),""))),"|"),0,1)

The final mess incorporates the suggestion from @marikamitsos to use iferror instead of if(iserror())

I think this pattern may be useful to other searchers, and a reminder to myself to be more mindful of return types! :)

Answered by Stephen on November 3, 2021

You would need to alter your "regular expression" in the sample regex

from: ((?:[[:alpha:]]+s?)+)
to: (?:[[:alpha:]]+s?)+

Following that, your formula will work just fine.

=textjoin("",true,arrayformula(if(iserror(REGEXEXTRACT(B2,Criteria!$A$2:$A)),"",choose(1,REGEXEXTRACT(B2,Criteria!$A$2:$A)))))

enter image description here

You could also use [A-Za-z] instead of [:alpha:] as shown in cell C2.


BUT
If you also change your formula and use the Arrayformula function in a different way, you can use just one formula for all rows.

=ArrayFormula(IFERROR(REGEXEXTRACT(B2:B,Criteria!$A$2:$A)))

enter image description here

Answered by marikamitsos on November 3, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP