TransWikia.com

Character-based transitions (part of a lexer)

Code Review Asked by CAD97 on October 27, 2021

As part of my prep for the Code Review (which looks like it will be Write your own programming language), I’ve been working on an LL(1) lexer generator in Rust.

Yes, lexer generators exist already, and full-on lexer/parser generators as well. My time might’ve been better spent building a Rust target for ANTLR (as that’s the tool I know best already).

Right now though, I’m just looking at my model for transitions in the finite automaton that gets built up to turn the regex-ish into an actual lexer. Because the lexer generator I’m building is Unicode-first, a simple table of input symbols would be 0x110000 characters wide. That’s pretty much an instant veto. Instead, I’m using a enum that allows globbing of sets of characters into one transition:

#[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, PartialOrd, Ord)]
pub enum Transition {
    Empty,
    Character(char),
    Range(char, char),
}

In the future, it will also support specifying blocks of characters by Unicode properties.

For the NFA model, just using the character and/or ranges specified by the user is fine; the ambiguity is inbuilt into the model anyway. But to make a lexer, I have to disambiguate different transitions, such that only one can be chosen at any time.

As a simple example, take the simple grammar:

If: "if" ;
Ident: [[:alpha:]][[:alnum:]]* ;

The NFA is trivial:

digraph nfa { rankdir=LR; node [shape = doublecircle]; If Ident; node [shape = circle]; Entry -> 1 [ label = "'i'" ]; 1 -> If [ label = "'f'" ]; Entry -> Ident [ label = "[:alpha:]" ]; Ident -> Ident [ label = "[:alnum:]" ]; }

But the DFA requires allowing for the 'i' and 'f' transitions being subsets of the transitions that the rule Ident follows.

digraph dfa { rankdir=LR; node [shape = doublecircle]; If Ident; node [shape = circle]; Entry -> 1 [ label = "'i'" ]; 1 -> If [ label = "'f'" ]; 1 -> Ident [ label = "[:alnum:] - 'f'" ]; If -> Ident [ label = "[:alnum:]" ]; Entry -> Ident [ label = "[:alpha:] - 'i'" ]; Ident -> Ident [ label = "[:alnum:]" ]; }

To that end, and the main point of asking this review, I have a function Transition::disambiguate which takes two transitions and returns a set of transitions that cover all combinations of those two transitions. So disambiguate( 'a'..='e', 'c' ) returns the set { 'a'..='b', 'c', 'd'..='e' }.

Enough of the theory, let’s see the code

use std::{char, fmt};
use std::cmp::{max, min};

#[derive(Debug, Clone, Copy, Hash, PartialEq, Eq, PartialOrd, Ord)]
pub enum Transition {
    Empty,
    Character(char),
    Range(char, char),
}

impl Transition {
    pub fn overlaps(&self, other: &Transition) -> bool {
        match (*self, *other) {
            (Transition::Empty, Transition::Empty) => true,
            (Transition::Empty, _) | (_, Transition::Empty) => false,
            (Transition::Character(x), Transition::Character(y)) => x == y,
            (Transition::Range(lo, hi), Transition::Character(c)) |
            (Transition::Character(c), Transition::Range(lo, hi)) => lo <= c && c <= hi,
            (Transition::Range(lo1, hi1), Transition::Range(lo2, hi2)) => hi1 >= lo2 || lo1 <= hi2,
        }
    }
}

impl From<(char, char)> for Transition {
    fn from(pair: (char, char)) -> Transition {
        let (lo, hi) = pair;
        if lo == hi {
            Transition::Character(hi)
        } else {
            Transition::Range(lo, hi)
        }
    }
}

impl From<char> for Transition {
    fn from(char: char) -> Transition {
        Transition::Character(char)
    }
}

// TODO: move
fn before(c: char) -> char {
    if c as u32 == 0 {
        panic!("Tried to get character before '\0'")
    }
    let mut point = c as u32 - 1;
    if point == 0xDFFF {
        point = 0xD800 - 1;
    }
    char::from_u32(point).expect("Unexpected surrogate codepoint")
}

// TODO: move
fn after(c: char) -> char {
    if c == char::MAX {
        panic!("Tried to get character after char::MAX")
    }
    let mut point = c as u32 + 1;
    if point == 0xD800 {
        point = 0xDFFF + 1;
    }
    char::from_u32(point).expect("Unexpected surrogate codepoint")
}

impl Transition {
    pub fn disambiguate(lhs: Transition, rhs: Transition) -> Vec<Transition> {
        if !Transition::overlaps(&lhs, &rhs) {
            return vec![lhs, rhs];
        }

        if lhs == rhs {
            return vec![lhs];
        }

        match (lhs, rhs) {
            (Transition::Character(_), Transition::Character(_)) |
            (Transition::Empty, _) |
            (_, Transition::Empty) => unreachable!(),
            (Transition::Range(lo, hi), Transition::Character(c)) |
            (Transition::Character(c), Transition::Range(lo, hi)) => {
                assert!(lo <= c && c <= hi);
                if lo == c {
                    vec![Transition::from(c), Transition::from((after(c), hi))]
                } else if c == hi {
                    vec![Transition::from((lo, before(c))), Transition::from(c)]
                } else {
                    assert!(lo < c && c < hi);
                    vec![
                        Transition::from((lo, before(c))),
                        Transition::from(c),
                        Transition::from((after(c), hi)),
                    ]
                }
            },
            (Transition::Range(lo1, hi1), Transition::Range(lo2, hi2)) => {
                assert!((lo2 <= lo1 && lo1 <= hi2) || (lo1 <= lo2 && lo2 <= hi1));
                let points = [
                    min(lo1, lo2),
                    max(lo1, lo2),
                    min(hi1, hi2),
                    max(hi1, hi2),
                ];
                vec![
                    Transition::from((points[0], before(points[1]))),
                    Transition::from((points[1], points[2])),
                    Transition::from((after(points[2]), points[3])),
                ]
            },
        }
    }
}

impl fmt::Display for Transition {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        match *self {
            Transition::Empty => write!(f, "ε"),
            Transition::Character(c) => write!(f, "{:?}", c),
            Transition::Range(low, high) => write!(f, "{:?}..={:?}", low, high),
        }
    }
}

#[cfg(test)]
#[cfg_attr(rustfmt, rustfmt_skip)]
mod test {
    use super::Transition;
    #[test]
    fn disambiguate_char_range() {
        assert_eq!(
            Transition::disambiguate(
                Transition::Character('c'),
                Transition::Range('a', 'e'),
            ),
            vec![
                Transition::Range('a', 'b'),
                Transition::Character('c'),
                Transition::Range('d', 'e'),
            ]
        );
        assert_eq!(
            Transition::disambiguate(
                Transition::Character('a'),
                Transition::Range('a', 'b'),
            ),
            vec![
                Transition::Character('a'),
                Transition::Character('b'),
            ]
        );
        assert_eq!(
            Transition::disambiguate(
                Transition::Character('b'),
                Transition::Range('a', 'b'),
            ),
            vec![
                Transition::Character('a'),
                Transition::Character('b'),
            ]
        );
        assert_eq!(
            Transition::disambiguate(
                Transition::Character('a'),
                Transition::Range('a', 'c'),
            ),
            vec![
                Transition::Character('a'),
                Transition::Range('b', 'c'),
            ]
        );
        assert_eq!(
            Transition::disambiguate(
                Transition::Character('c'),
                Transition::Range('a', 'c'),
            ),
            vec![
                Transition::Range('a', 'b'),
                Transition::Character('c'),
            ]
        )
    }

    #[test]
    fn disambiguate_range_range() {
        assert_eq!(
            Transition::disambiguate(
                Transition::Range('a', 'c'),
                Transition::Range('c', 'e'),
            ),
            vec![
                Transition::Range('a', 'b'),
                Transition::Character('c'),
                Transition::Range('d', 'e'),
            ]
        );
        assert_eq!(
            Transition::disambiguate(
                Transition::Range('a', 'b'),
                Transition::Range('b', 'c'),
            ),
            vec![
                Transition::Character('a'),
                Transition::Character('b'),
                Transition::Character('c'),
            ]
        );
        assert_eq!(
            Transition::disambiguate(
                Transition::Range('a', 'c'),
                Transition::Range('c', 'd'),
            ),
            vec![
                Transition::Range('a', 'b'),
                Transition::Character('c'),
                Transition::Character('d'),
            ]
        );
        assert_eq!(
            Transition::disambiguate(
                Transition::Range('a', 'b'),
                Transition::Range('b', 'd'),
            ),
            vec![
                Transition::Character('a'),
                Transition::Character('b'),
                Transition::Range('c', 'd'),
            ]
        );
        assert_eq!(
            Transition::disambiguate(
                Transition::Range('a', 'e'),
                Transition::Range('b', 'd'),
            ),
            vec![
                Transition::Character('a'),
                Transition::Range('b', 'd'),
                Transition::Character('e'),
            ]
        );
        assert_eq!(
            Transition::disambiguate(
                Transition::Range('a', 'c'),
                Transition::Range('b', 'd'),
            ),
            vec![
                Transition::Character('a'),
                Transition::Range('b', 'c'),
                Transition::Character('d'),
            ]
        )
    }
}

I have cut down fn disambiguate a lot from its first iteration, and all of the tests I added continue to pass. I’m just afraid that I may be missing some corner cases in my tests, and that reading that mess of a match isn’t the easiest task. unreachable!() is also a (weak) smell, since your code could potentially be structured such that you don’t need it.

Transition::Range is an inclusive range and constructing a decreasing or size one range is an error in use.

Oh, and rustfmt.toml:

fn_single_line = true
match_block_trailing_comma = true
wrap_match_arms = false

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP