TransWikia.com

regex - how to capture pattern without different pattern before it?

Stack Overflow Asked on November 29, 2021

I’m trying to parse out prices but ignore two patterns that are also prices. One of the exclusions is the total price which is at the end which I am using lookahead to ignore. The second exclusion is if there’s a variation of the letter Q before a price, for example Q10.00 or Q AWSMSN11.32 but I want to include if there’s a three letter alpha that happens to end in Q such as YMQ234.03.

I’ve added a negative lookbehind but can’t seem to get what I want.

This is the pattern I’ve tried: (?<![Qd]) ?M?(d+.d{2})(?=.*d+.d{2}END)

test strings

ABC WS YMQ234.03WS TOY234.03USD468.06END
FUR BB LAB Q10.00 199.00USD209.00END
YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END
PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END

regex101

Expected output

+---------------------------------------------------------------------------+---------+---------+
| ABC WS YMQ234.03WS TOY234.03USD468.06END                                  | 234.03  | 234.03  |
| FUR BB LAB Q10.00 199.00USD209.00END                                      | 199.00  |         |
| YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END | 2503.08 | 2503.08 |
| PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END                 | 342.41  | 282.24  |
+---------------------------------------------------------------------------+---------+---------+

4 Answers

You might also match what you don't want, and capture what you do want.

Match optional whitespace and uppercase chars where there is a Q and match the decimal value that follows.

Make the exception of eliminating this match asserting that it is not preceded by 2 times an uppercase A-Z followed by Q

After the alternation, capture the decimal value in group 1, asserting that it is not followed by END

b[A-Z ]*Q[A-Z ]*(?<![A-Z][A-Z]Q)d+.d+|(d+.d{2})(?!END)

Explanation

  • b[A-Z ]*Q[A-Z ]* Word boundary, match a Q between optional spaces and uppercase chars
  • (?<![A-Z][A-Z]Q) Negative lookbehind, assert not 2 uppercase chars A-Z followed by Q directly to the left
  • d+.d+ Match a decimal value
  • | Or
  • ( Capture group 1
    • d+.d{2} Match 1+ digits followed by a dot and 2 digits
  • ) Close group 1
  • (?!END) Negative lookahead, assert what is directly to the right is not END

Regex demo | Python demo

For example

import re

regex = r"b[A-Z ]*Q[A-Z ]*(?<![A-Z][A-Z]Q)d+.d+|(d+.d{2})(?!END)"
strings = [
    "ABC WS YMQ234.03WS TOY234.03USD468.06END",
    "FUR BB LAB Q10.00 199.00USD209.00END",
    "YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END",
    "PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END"
]

for str in strings:
    print('{}: {}'.format(str, [x.group(1) for x in re.finditer(regex, str) if x.group(1)]))

Output

ABC WS YMQ234.03WS TOY234.03USD468.06END: ['234.03', '234.03']
FUR BB LAB Q10.00 199.00USD209.00END: ['199.00']
YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END: ['2503.08', '2503.08']
PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END: ['342.41', '282.24']

Answered by The fourth bird on November 29, 2021

You may use a simple well-known trick when you need to discard some matches: use an optional capturing group that will only match when a failure is expected:

(bQs?[A-Z]*)?(?<!d)(d+.d{2})(?=.*d.d{2}END)
|_____________|

See the regex demo. Whenever match data object Group 1 is not empty, the match should be dropped.

Regex details

  • (bQs?[A-Z]*)? - an optional capturing group #1 that captures
    • bQ - a word boundary followed with Q
    • s? - one or zero whitespaces
    • [A-Z]* - any 0 or more ASCII uppercase letters
  • (?<!d) - no digit immediately on the left is allowed
  • (d+.d{2}) - 1+ digits, . and then any two digits
  • (?=.*d.d{2}END) - any 0 or more chars other than line break chars as many as possible followed with a digit, ., two digits and END must appear immediately to the right of the current location.

See the Python implementation with re:

import re
strings = ['ABC WS YMQ234.03WS TOY234.03USD468.06END','FUR BB LAB Q10.00 199.00USD209.00END','YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END','PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END']
rx = r'(bQs?[A-Z]*)?(?<!d)(d+.d{2})(?=.*d.d{2}END)'
for s in strings:
    matches = [x.group(2) for x in re.finditer(rx, s) if not x.group(1)] # note the if condition that drops unwlecome matches
    print(s, matches, sep=" => ")

Output:

ABC WS YMQ234.03WS TOY234.03USD468.06END => ['234.03', '234.03']
FUR BB LAB Q10.00 199.00USD209.00END => ['199.00']
YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END => ['2503.08', '2503.08']
PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END => ['342.41', '282.24']

Answered by Wiktor Stribiżew on November 29, 2021

You could use regex module instead of re with the pattern:

Q[A-Z ]*(?<!b[A-Z]{2}Q)[d.]+(*SKIP)(*F)|d+(?:.d+)(?!d*END$)

See the online demo.


In Python this could look like:

import regex
arr = ['ABC WS YMQ234.03WS TOY234.03USD468.06END', 'FUR BB LAB Q10.00 199.00USD209.00END', 'YAS DG TYY Q AWSMSN11.32 2503.08LD VET Q JKLOLE11.32 2503.08USD5028.80END', 'PPP VP LAP Q10.00 M342.41EE SFD Q10.00 282.24USD644.65END']
res = [regex.findall(r'Q[A-Z ]*(?<!b[A-Z]{2}Q)[d.]+(*SKIP)(*F)|d+(?:.d+)(?!d*END$)',x) for x in arr]
print(res)

Prints:

[['234.03', '234.03'], ['199.00'], ['2503.08', '2503.08'], ['342.41', '282.24']]

Answered by JvdV on November 29, 2021

By matching the following regular expression the values of interest will be saved to capture group 1.

r'(?=[^Qd]*(?=d))(?:(?<!Q)|(?<=[A-Z]{2}Q)|D*d+.d{2})[^Qd]*([1-9]d*.d{2})(?!ENDb)'

Start your engine! | Python code

Python's regex engine performs the following operations.

(?=               : begin positive lookahead
  [^Qd]*         : match 0+ chars other that 'Q' and digits
  (?=d)          : positive lookahead asserts next char is a digit
)                 : end positive lookahead
(?:               : begin non-capture group
  (?<!Q)          : negative lookbehind asserts current match did
                    not end with 'Q'
|                 : or
  (?<=            : begin positive lookbehind
    [A-Z]{2}Q     : match two letters, 'Q'
  )               : end positive lookbehind
|                 : or
  D*d+.d{2}   : match 0+ non-digits, 1+ digits, '.', 2 digits
)                 : end non-capture group
[^Qd]*           : match 0+ chars other than 'Q' or digits
([1-9]d*.d{2}) : match digit other than zero, 0+ digits, '.',
                    2 digits
(?!ENDb)         : negative lookahead asserts current match is
                  : not followed by 'END'

The positive lookahead (?=[^Qd]*(?=d)) leaves the regex engine's internal string pointer at its current location if 'Q' does not appear between the current location and the first digit of a string that matches [1-9]d*.d{2}. Otherwise it moves the pointer to just after the last 'Q' that precedes the first digit of a string that matches [1-9]d*.d{2}.

Answered by Cary Swoveland on November 29, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP