Case Study: Regular Expressions

The property Ev provides a simple example to illustrate inductive definitions and the basic techniques for reasoning about them, but it's not very exciting - after all, it's equivalent to the two non-inductive definitions of parity we've already seen, and it doesn't seem to offer any concrete benefits over them. To give a better sense of the power of inductive definitions, let's show how to use them to model a classic concept from computer science: regular expressions.

Regular expressions are a simple language for describing strings, defined as follows:


type Regexp (t: Type) {
    emptyset                              
    emptystr                              
    chr   (h: t)                          
    app   (st1: Regexp t) (st2: Regexp t) 
    union (st1: Regexp t) (st2: Regexp t)
    star  (st1: Regexp t)                 
  }

Note that this definition is polymorphic: regular expressions in Reg_exp t describe strings with characters drawn from t - that is, lists of elements of t.

(We deviate slightly from standard practice by not requiring the type t to be finite. This results in a slightly different theory of regular expressions, but the difference is not significant for our purposes.)

We connect regular expressions and strings through the following rules, which define when a regular expression matches a given string:

• The EmptySet expression does not match any string. • The EmptyStr expression matches the empty string []. • The Chr x expression matches the single-character string [x]. • If re1 matches s1 and re2 matches s2, then (App re1 re2) matches (App s1 s2). • If at least one of the expressions re1 and re2 matches s, then Union re1 re2 matches s. • Finally, if we can write a string s as the concatenation of a sequence of strings s = (App s_1 (App... s_k...)), and the expression re matches each of the strings s_i, then Star re matches s.

As a special case, the sequence of strings may be empty, so Star re always matches the empty string [] regardless of what re is.

We can easily translate this informal definition into a data definition as follows:


type Expmatch <t> ~(xs: Data.List t) (r: Regexp t) {
mempty              : Expmatch t []   Regexp.emptystr 
mchar (x: t)        : Expmatch t [x] (Regexp.chr x)
mapp 
  <s1: Data.List t>
  <s2: Data.List t>
  <re1: Regexp t>
  <re2: Regexp t>
  (e1: Expmatch t s1 re1)
  (e2: Expmatch t s2 re2)
  : Expmatch t (Data.List.concat t s1 s2) (Regexp.app re1 re2)
munionl <s1: Data.List t> (re1: Regexp t) (re2: Regexp t)
  (e1: Expmatch t s1 re1)
  : Expmatch t s1 (Regexp.union re1 re2)
munionr (re1: Regexp t) <s2: Data.List t> (re2: Regexp t)
  (e1: Expmatch t s2 re2)
  : Expmatch t s2 (Regexp.union re1 re2)
mstarz (r: Regexp t) : Expmatch t [] (Regexp.star r)  
mstarapp 
  <s1: Data.List t> 
  <s2: Data.List t>
  <r1: Regexp t>
  (e1: Expmatch s1 r1)
  (e2: Expmatch s2 (Regexp.star r1))
  : Expmatch t (Data.List.concat t s1 s2) (Regexp.star r1)
}

Let's illustrate these rules with some examples.


Regexp_ex1 : Expmatch [1] (Regexp.chr 1)
Regexp_ex1 = Expmatch.mchar 1 

Regexp_ex2 : Expmatch [1, 2] (Regexp.app (Regexp.chr 1) (Regexp.chr 2))
Regexp_ex2 = Expmatch.mapp [1] (Regexp.chr 1) (Expmatch.mchar 1) [2] (Regexp.chr 2) (Expmatch.mchar 2)

Using pattern matching, we can also show that certain strings do not match a regular expression:


Regexp_ex3 : Not (Expmatch [1, 2] (Regexp.chr 1))
Regexp_ex3 = Empty.absurd _

We can define auxiliary functions to help write regular expressions. The reg_exp_of_list function constructs a regular expression that matches exactly the list it receives as an argument:


Regexp_of_list <t> (xs: List t)        : Regexp t
Regexp_of_list t List.nil              = Regexp.emptystr
Regexp_of_list t (List.cons xs.h xs.t) = Regexp.app (Regexp.chr xs.h) (Regexp_of_list t xs.t)

Regexp_ex4 : Expmatch [1, 2, 3] (Regexp_of_list [1, 2, 3])
Regexp_ex4 = 
  (Expmatch.mapp 
    [1] 
    (Regexp.chr 1) 
    (Expmatch.mchar 1) 
    [2, 3] 
    (Regexp_of_list [2, 3])
    (Expmatch.mapp
      [2]
      (Regexp.chr 2)
      (Expmatch.mchar 2) 
      [3]
      (Regexp_of_list [3])
      (Expmatch.mapp
        [3]
        (Regexp.chr 3)
        (Expmatch.mchar 3)
        []
        (Regexp_of_list [])
        (Expmatch.mempty)
      )
    )
  )

We can also prove general facts about Exp_match. For example, the following lemma shows that every string s that matches re also matches Star re.


Mstar1 <t> <s: List t> <re: Regexp t> (e: Expmatch s re) : Expmatch s (Regexp.star re)  
Mstar1 t s re e = 
  let msz = Expmatch.mstarz re
  let mss = Expmatch.mstarapp s [] re e msz
  let lst = List_concat s
  let rwt = Equal.rewrite lst (x => (Expmatch t x (Regexp.star re))) mss
  rwt

(Note the use of appendNilRightNeutral*** to change the goal of the theorem to exactly the same form expected by MStarApp.)

Exp_match_ex1

The following lemmas show that the informal matching rules provided at the beginning of the chapter can be derived from the formal inductive definition.


  Munion 
  <t> 
  <s: List t> 
  <re1:  Regexp t> 
  <re2:  Regexp t> 
  (p: Pair (Expmatch s re1) (Expmatch s re2))
  : Expmatch s (Regexp.union re1 re2)
Munion t s re1 re2 (Pair.new fst snd) = 
  Expmatch.munionl t s [] re1 re2

The next lemma is stated in terms of the fold function from the Poly chapter: If ss: List (List t) represents a sequence of strings s1, ..., sn, then fold (++) ss [] is the result of concatenating all of them.


Fold <x> <y> (f: x -> y -> y) (l: Data.List x) (b: y) : y
Fold x y f Data.List.nil b = b
Fold x y f (Data.List.cons xs.h xs.t) b = f xs.h (Fold f xs.t b) 

MStar_ <t>
  (ss : Data.List (Data.List t))
  (re : Regexp t)
  (construct_match : (s : Data.List t) -> (i : In s ss) -> Expmatch s re)
  : Expmatch (Concats ss) (Regexp.star re)
MStar_ ss re construct_match = ?

Reg_exp_of_list

Prove that reg_exp_of_list satisfies the following specification:


Reg_exp_of_list_spec <t> (s1: Data.List t) (s2: Data.List t) : Iff (Expmatch s1 (Regexp_of_list s2)) (Prop.Equal s1 s2)
Reg_exp_of_list_spec s1 s2 = ?

Since the definition of Exp_match has a recursive structure, it is expected that proofs involving regular expressions often require induction based on evidence. For example, suppose we wanted to prove the following intuitive result: if a regular expression re matches some string s, then every element of s must occur somewhere in re. To state this theorem, we first define a function re_chars that lists all characters occurring in a regular expression:


Re_chars <t> (re: Regexp t) : Data.List t
Re_chars t Regexp.emptyset  = []
Re_chars t Regexp.emptystr  = []
Re_chars t (Regexp.chr x)   = [x]
Re_chars t (Regexp.app r0 r1) = Data.List.concat (Re_chars r0) (Re_chars r1)
Re_chars t (Regexp.union r0 r1) = Data.List.concat (Re_chars r0) (Re_chars r1)
Re_chars t (Regexp.star r0) = Re_chars r0

Re_star <t> (re: Regexp t) : Prop.Equal (Re_chars (Regexp.star re)) (Re_chars re)
Re_star e re = Prop.Equal.refl

We can then formulate our theorem as follows:


Destruct <a> (x: a) (lx: Data.List a) (ly: Data.List a) (i: In x (Data.List.concat a lx ly)) : (Data.Either (In x lx) (In x ly))
Destruct a x lx ly i =  
  let Data.Pair.new fst snd = (In_app_iff x lx ly) 
  let f = (fst :: (_) -> (_)) i 
  f

Construct <a> (x: a) (lx: Regexp a) (ly: Regexp a) (e: Data.Either (In x (Re_chars lx)) (In x (Re_chars ly))) : In x (Data.List.concat (Re_chars lx) (Re_chars ly))
Construct a x lx ly e = (Data.Pair.snd ( In_app_iff x (Re_chars lx) (Re_chars ly)) :: (_) -> (_)) e

In_re_match <a> (x: a) (re: Regexp a) (s: Data.List a) (e: Expmatch s re) (i: In x s) : In x (Re_chars re)
In_re_match a x (Regexp.emptyset) Data.List.nil e i = i
In_re_match a x (Regexp.emptyset) (Data.List.cons head tail) e i = Data.Empty.absurd _ //todo
In_re_match a x (Regexp.emptystr) (Data.List.nil) Expmatch.mempty i = i
In_re_match a x (Regexp.chr u s)  (Data.List.cons t re (Data.List.nil _)) (Expmatch.mchar c) i = 
  let e0 = Prop.Equal.refl :: Prop.Equal c re
  let e1 = Prop.Equal.refl :: Prop.Equal s c
  let chn = Prop.Equal.chain e1 e0
  let rrt = Prop.Equal.rewrite (Prop.Equal.mirror chn) (y => (Data.Either (Prop.Equal _ x y) _)) i 
  rrt
In_re_match a x (Regexp.app t z y) (Data.List.concat u l1 l2) e i = 
  let e0 = Prop.Equal.refl :: Prop.Equal u a
  let l3 = l1 :: Data.List a
  let l4 = l2 :: Data.List a
  let e1 = Prop.Equal.refl :: Prop.Equal l1 l3
  let e2 = Prop.Equal.refl :: Prop.Equal l2 l4
  let e3 = Prop.Equal.refl :: Prop.Equal  (Data.List.concat u l1 l2) (Data.List.concat a l3 l4) 
  let i = i :: (In a x (Data.List.concat u l1 l2))
  let rrt = Prop.Equal.rewrite e3 ((k: (Data.List a))=> (In a x k)) i 
  let des = Destruct x l3 l4 rrt
  des
In_re_match a x re Data.List.nil (Expmatch.mstarz z) i = Data.Empty.absurd _ //todo

Re_not_empty

Write a recursive function re_not_empty that tests whether a regular expression matches any string. Prove that your function is correct.


Re_not_empty <t> (re: Regexp t)   : Data.Bool
Re_not_empty t re = ?

Re_not_empty_correct <t> <re: Regexp t> 
: Equivalence (Data.Sigma (Data.List t) (s => Expmatch s re)) (Prop.Equal (Re_not_empty re) Data.Bool.true)
Re_not_empty_correct t re = ?

The remember Tactic

Rewriting the section, dependent pattern matching solves all of this.

A potentially confusing feature of the induction tactic is that it easily allows you to attempt setting up an induction on a term that is not sufficiently general. The effect of this is to lose information (just as destruct can do) and leave you unable to conclude the proof. Here's an example:


Star_app  <a> (re: Regexp a) (s1: Data.List a) (s2: Data.List a) (exp0: Expmatch a s1 (Regexp.star re)) (exp1: Expmatch a s2 (Regexp.star re)) 
: (Expmatch (Data.List.concat s1 s2) (Regexp.star re))
Star_app a re (Data.List.nil _) s2 (Expmatch.mstarz e) exp1 = exp1
Star_app a re s1 s2 (Expmatch.mstarapp t r s r1 m ms) exp1 = 
  let e0 = Prop.Equal.refl :: Prop.Equal r1 re
  let e1 = Prop.Equal.refl :: Prop.Equal (Data.List.concat r s) s1
  let ind = Expmatch.mstarapp m (Star_app r1 s s2 ms exp1)
  let rrt = Prop.Equal.rewrite e0 (x => (Expmatch _ _ (Regexp.star _ x))) ind
  let aux = App_assoc r s s2
  let rrt = Prop.Equal.rewrite aux (x => (Expmatch _ (x) (_))) rrt
  let rrt = Prop.Equal.rewrite e1 (x => (Expmatch t (Data.List.concat _ x s2) (Regexp.star t re))) rrt
  rrt

Exp_match_ex2


MStar2 <t> (re: Regexp t) (s: Data.List t) (ss: Data.List (Data.List t)) (exp: Expmatch s (Regexp.star re)) 
: (Data.Pair (Prop.Equal s (Fold (x => y => Data.List.concat x y ) ss [])) ((s2: Data.List t) -> (In s2 ss) -> (Expmatch s2 re)))

The lemma MStar'' below (along with its inverse, the exercise MStar' above), shows that our definition of Exp_match for Star is equivalent to the one informally given earlier.

pumping

One of the earliest truly interesting theorems in the theory of regular expressions is the so-called pumping lemma, which states, informally, that any sufficiently long string s matching a regular expression re can be "pumped" by repeating some middle section of s an arbitrary number of times to produce a new string also matching re.

To start, we need to define what it means to be "sufficiently long". Since we are working in a constructive logic, we actually need to be able to compute, for each regular expression re, the minimum length for strings s to guarantee "pumpability".



pumping_constant :

Next, it is useful to define an auxiliary function that repeats a string (appends itself) a certain number of times.


napp :

Now, the pumping lemma itself states that if s => re and if the length of s is at leas't the pumping constant of re, then s can be divided into three substrings s1 ++ s2 ++ s3 such that s2 can be repeated any' number of times, and the result, when combined with s1 and s3, will still match re. Since D is also guaranteed to be non-empty, this gives us a (constructive!) way of generating strings matching re that are as long as we desire.



pumping :

To speed up the proof (which you should fill in), the omega tactic, which is enabled by the Require below, is helpful in several places to automatically complete tedious low-level arguments involving equalities or inequalities over natural numbers. We will revisit the omega tactic in a later chapter, but feel free to try it now if you'd like. The first case of the induction gives an example of how it is used.



pumping m le = ?pumping_rhs