[Ur] Regular expressions DSEL?

Ziv Scully zivscully at gmail.com
Thu Mar 9 10:20:40 EST 2017


I agree we probably don't want to do fancy things that are only available
the server. However, I think it's still possible to support nice behavior
for alt just by postprocessing on the Ur/Web side.

Let's say "names" are capture groups from the library user's perspective
and "indices" are capture groups from the FFI's perspective. Right now,
each name corresponds to exactly one index. You could support alt by having
each name correspond to multiple indices, each of which is matched at most
once.

Concrete example: say you have (#X:a)(#Y:b)|(#X:c)(#Z:d)—that is, abusing
notation somewhat, an expression that captures names either X and Y or X
and Z. You could represent this using the record of index lists {X = [1,3],
Y = [2], Z = [4]}. When you get the result list back from the FFI, you have
to check groups 1 and 3 to see if X was matched. If you only ever
concatenate index lists as part of alt, you ensure that you never have two
non-null matches for the same name: given a match, in each name's index
list, all but at most one of the indices is from an alt branch that wasn't
taken.

You could take these ideas further to track statically when a group is
guaranteed to be matched (like X above) or not (like Y or Z above). The
easiest way is probably to have two record parameters for the type—one for
guaranteed match names, another for optional match names. For example
(leaving out folders and disjoitedness constraints):

star : r ::: Unit -> s ::: Unit -> t r s -> t [] (r + s)

plus : r ::: Unit -> s ::: Unit -> t r s -> t r s

capture : r ::: Unit -> s ::: Unit -> nm :: Name -> t r s -> t (r ++ [nm])
s.

The left record has the names of guaranteed matches, and the right record
has the names of optional ones. You can see that plus keeps guaranteed
matches guaranteed, star makes all of them optional. The most complicated
type would, of course, be alt.

alt : r ::: Unit -> r1 ::: Unit -> r2 ::: Unit s1 ::: Unit -> s2 ::: Unit
      -> t (r ++ r1) s1 -> t (r ++ r2) s2
      -> t r (r1 ++ r2 ++ s1 ++ s2)

That is, the only groups that are guaranteed to be matched are those
guaranteed in both of the branches; all others are optional.


On Thu, Mar 9, 2017 at 09:17 Artyom Shalkhakov <artyom.shalkhakov at gmail.com>
wrote:

> 2017-03-09 19:39 GMT+06:00 Ziv Scully <zivscully at gmail.com>:
> > Looks good—glad it worked out!
> >
> > My main question: what happens if the whole string is matched but a
> > particular subgroup isn't? This can happen with alt or star, for
> instance. I
> > couldn't tell from the types. I would expect a return type for match to
> be
> > something like `match r (option string)` instead of `option (match r
> > string)`. I don't know the underlying regex library, so perhaps there's a
> > good reason for this.
>
> I think a test is in order; I don't know what happens in that case (do
> you have an example handy?). From JS RegExp object documentation, it
> seems that subgroups are always present (but, indeed, the strings
> might be nulls or something else entirely, this is JS, after all).
> Good catch!
>
> >
> > I also thought it may be useful if alt allowed the same group to be
> captured
> > on either side, but I realize that takes further fiddling with subgroup
> > indices and post-processing of the results (each group field name would
> map
> > to a list of indices instead of a single index), so it can wait for
> version
> > 2 :).
> >
>
> I was a bit puzzled about the type you gave for [alt], it only dawned
> on me later what you meant with it. :-)
>
> I can see two questions here:
>
> 1. say we have an expression like [a(b)*c], if matched against
> [abbbbbbc] in JS, it will only capture the *first* b -- I guess
> Boost.Regex which is used on the server behaves the same (Benjamin,
> could you help out on this?)
>
> 2. say we have an expression like [(a)|(b)], in JS the groups will be
> 1 and 2. PCRE and some other engines support a "group reset" feature,
> so that we can indeed capture both alternatives with one group. This
> is unsupported in JS, so I decided to leave it as is, for now.
>
> >
> > On Tue, Mar 7, 2017 at 02:11 Artyom Shalkhakov <
> artyom.shalkhakov at gmail.com>
> > wrote:
> >>
> >> Hello Ziv,
> >>
> >> I've implemented your idea [1], and it works. Looking for feedback.
> >>
> >> 2017-02-20 15:48 GMT+06:00 Ziv Scully <zivscully at gmail.com>:
> >> > By parametrizing over a type-level record, you can encode the names of
> >> > matched subgroups in the type of the regular expression. The interface
> >> > probably looks something like:
> >> >
> >> > (* Abstract type of uncompiled regular expressions. *)
> >> > con t :: {Unit} (* names of captured subgroups *) -> Type
> >> >
> >> > val literal :
> >> >     string -> t []
> >> > val any : t []
> >> > val concat : r1 ::: {Unit} -> r2 ::: {Unit} -> [r1 ~ r2] =>
> >> >     t r1 -> t r2 -> t (r1 ++ r2)
> >> > val star : r ::: {Unit} ->
> >> >     t r -> t r
> >> > val alt : r ::: {Unit} -> s1 ::: {Unit} -> s2 ::: {Unit} -> [r ~ s1]
> =>
> >> > [r ~
> >> > s2] => [s1 ~ s2] =>
> >> >     t (r ++ s1) -> t (r ++ s2) -> t (r ++ s1 ++ s2)
> >> > val capture : r ::: {Unit} ->
> >> >     nm :: Name -> [r ~ [nm]] => t r -> t (r ++ [nm])
> >> >
> >> > (* Abstract type of compiled regular expressions *)
> >> > con compiled :: {Unit} -> Type
> >> >
> >> > val compile : r ::: {Unit} ->
> >> >     t r -> compiled r
> >> >
> >> > Obviously this probably won't exactly work. For instance, if the
> >> > underlying
> >> > library requires explicit names for captured groups, you'll need
> >> > something
> >> > more like
> >> >
> >> > val capture : r ::: {Unit} -> nm :: Name -> [r ~ [nm]] =>
> >> >     {nm : string} -> t r -> t (r ++ [nm])
> >> >
> >> > because there's no way to extract a string from a type-level Name. In
> >> > fact,
> >> > this way has another benefit: the internal representation of the type
> >> > can
> >> > just be a string! All the tracking of subgroups can happen at the type
> >> > level. You may not even need compile.
> >> >
> >> > This interface assumes each subgroup may appear 0 or 1 times. You
> could
> >> > maybe keep track of the number of matches more precisely by using
> {Type}
> >> > instead of {Unit} and using a type class to implement type-level
> >> > functions.
> >> >
> >> > con t : {Type} (* maps names of captured subgroups to type-level
> >> > representation of number of matched fields *) -> Type
> >> >
> >> > type one
> >> > type oneOrNone
> >> >
> >> > class meet : Type -> Type -> Type
> >> > val meet_one_oneOrNone : weaken one oneOrNone oneOrNone
> >> > val meet_oneOrNone_one : weaken oneOrNone one oneOrNone
> >> > val meet_oneOrNone_oneOrNone : weaken oneOrNone oneOrNone oneOrNone
> >> >
> >> > This makes the types more complicated, but I suspect (though I don't
> >> > know
> >> > for sure) that Ur/Web will be able to infer the extra stuff in any
> >> > concrete
> >> > (that is, monomorphic) use, so the user code will still look pretty.
> >> > Here's
> >> > how some of the functions look with this. The place weaken gets used
> is
> >> > alt.
> >> >
> >> > val star : r ::: {Type} ->
> >> >     t r -> t (map (fn _ => oneOrNo) r)
> >> >
> >> > val alt : r ::: {Type * Type * Type} -> s1 ::: {Type} -> s2 ::: {Type}
> >> > ->
> >> >     [map fst3 r ~ s1] => [map snd3 r ~ s2] => [s1 ~ s2] =>
> >> >     $(map (fn (t1, t2, t3) => weaken t1 t2 t3) r)
> >> >     t (map fst3 r ++ s1) -> t (map snd3 r ++ s2)
> >> >     -> t (map thd3 r ++ map (fn _ => oneOrNo) (s1 ++ s2))
> >> >
> >> > You could extend the same sort of idea to types like oneOrMore and
> >> > noneOrMore—useful if the underlying library can return a list of
> matches
> >> > inside a + or *—by adding more cases to weaken.
> >> >
> >> > Here's a Haskell library that takes basically this approach:
> >> > https://github.com/sweirich/dth/tree/master/popl17/src
> >> >
> >> > Hope this helps!
> >> > Ziv
> >> >
> >> >
> >> > On Mon, Feb 20, 2017 at 12:00 AM, Artyom Shalkhakov
> >> > <artyom.shalkhakov at gmail.com> wrote:
> >> >>
> >> >> 2017-02-20 10:47 GMT+06:00 <fold at tuta.io>:
> >> >>>
> >> >>> Maybe just define another attribute in basis.urs?
> >> >>>
> >> >>> It would be much interesting to define a DSEL though. I have not
> seen
> >> >>> any
> >> >>> attempts to create a DSEL for regular expressions.
> >> >>>
> >> >>
> >> >> Indeed! And it seems like it would be a nice exercise as well having
> >> >> some
> >> >> practical uses to it.
> >> >>
> >> >> I envision something along the lines of:
> >> >>
> >> >> datatype RegExp = ... (* constructors *)
> >> >>
> >> >> type t (* abstract, not exported outside the module *)
> >> >> val compile : RegExp -> t (* the trusted function *)
> >> >> val matches : string -> t -> bool (* check if passed-in string
> matches
> >> >> the
> >> >> expression *)
> >> >>
> >> >> And the implementation might use something unsafe (like a string, or
> an
> >> >> actual regexp object in JS). Then we could offload the interpretation
> >> >> of
> >> >> regexps to an off-the-shelf library. The [compile] function would
> print
> >> >> and
> >> >> escape the AST to a regexp string, and the Ur/Web compiler might
> >> >> actually
> >> >> optimize all of this away.
> >> >>
> >> >> If we are to take care of matching subgroups, that would require a
> more
> >> >> complicated type. I guess that should be doable as well, but I don't
> >> >> see
> >> >> immediately how to handle this.
> >> >>
> >> >>>
> >> >>>
> >> >>>
> >> >>> 20. Feb 2017 16:28 by artyom.shalkhakov at gmail.com:
> >> >>>
> >> >>>
> >> >>> Hello all,
> >> >>>
> >> >>> 'm doing form validation stuff, following examples on MDN. [1]
> >> >>>
> >> >>> It seems like Ur/Web doesn't handle regular expressions?
> >> >>>
> >> >>> I guess it should not be too diffcult to define a DSEL in Ur/Web
> (just
> >> >>> AST constructors and a few trusted functions: e.g. [compile : AST ->
> >> >>> regexp], where [regexp] is an abstract type, or maybe just a string
> >> >>> :-)) and
> >> >>> have it work at least in the browser.
> >> >>>
> >> >>> Could somebody point me in the right direction here?
> >> >>>
> >> >>> --
> >> >>> Cheers,
> >> >>> Artyom Shalkhakov
> >> >>>
> >> >>> [1]
> >> >>>
> >> >>>
> https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/HTML5/Constraint_validation
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Cheers,
> >> >> Artyom Shalkhakov
> >> >>
> >> >> _______________________________________________
> >> >> Ur mailing list
> >> >> Ur at impredicative.com
> >> >> http://www.impredicative.com/cgi-bin/mailman/listinfo/ur
> >> >>
> >> >
> >> >
> >> > _______________________________________________
> >> > Ur mailing list
> >> > Ur at impredicative.com
> >> > http://www.impredicative.com/cgi-bin/mailman/listinfo/ur
> >> >
> >>
> >> --
> >> Cheers,
> >> Artyom Shalkhakov
> >>
> >> [1] https://github.com/bbarenblat/urweb-regex/pull/1
> >>
> >> _______________________________________________
> >> Ur mailing list
> >> Ur at impredicative.com
> >> http://www.impredicative.com/cgi-bin/mailman/listinfo/ur
> >
> >
> > _______________________________________________
> > Ur mailing list
> > Ur at impredicative.com
> > http://www.impredicative.com/cgi-bin/mailman/listinfo/ur
> >
>
>
>
> --
> Cheers,
> Artyom Shalkhakov
>
> _______________________________________________
> Ur mailing list
> Ur at impredicative.com
> http://www.impredicative.com/cgi-bin/mailman/listinfo/ur
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.impredicative.com/pipermail/ur/attachments/20170309/ddb58573/attachment-0001.html>


More information about the Ur mailing list