We need to finish out the Regex API (and decide if/how we might deal with JVM, JS, and .NET incompatibilities).
brianMon 8 Jun 2009
Promoted to ticket #630 and assigned to brian
JohnDGMon 8 Jun 2009
It needs to be platform independent. Otherwise what's the point of Fan being cross-platform compatible???
Cross-platform compatibility implies one of two approaches:
Bundling a regexp engine in Fan, which can be written in Fan (but based on an existing regular expression engine).
Choosing a format, and then compiling regular expressions to native. This means writing a parser for the chosen regular expression format, and targeting different engines at runtime or compile time (the RegExp DSL could easily do it at compile time along with pattern checking).
brianMon 8 Jun 2009
I guess I'd have see how different they really are first. Are we talking obscure edge cases or substantial differences? Does anyone know?
tompalmerMon 8 Jun 2009
I agree with JohnDG that the default should be full compatibility (including the same "you can't do thats") across platforms. I think his two options are both worth considering. A third is one that just prohibits features that aren't consistent across all platforms.
I'm sure there are differences between the platforms (vague memories from past experience), but I have not done a full nor recent study of the issues.
Doing a regex engine in Fan (option 1) is my favorite idea at the moment. Maybe there is some nicely licensed, fully featured, easy reading engine out there to base things on. Don't know.
KevinKelleyMon 8 Jun 2009
I think it would be worthwhile to do a regex implementation in Fan; I was scouting around the other day for regex links since I want them for the grammar generator I'm working on (Earley parser in Fan, with user interface).
A few links: Here's an article about implementing Thompson's NFA construction to avoid a nasty edge case that apparently exists in many common regex implementations; there's code in C but I'm not sure it's a production-level implementation; the article refers to a page of links to RE implementations, including Plan 9 grep;
TRE seems like a pretty complete and portable (C I think) implementation of POSIX regex, BSD licensed.
Doing it right seems to be non-trivial, so for the moment I'm tabling the re-implement option, in favor of using Fan's Java/C# wrap until it turns out not to be good enough.
tompalmerMon 8 Jun 2009
Doing it right seems to be non-trivial, so for the moment I'm tabling the re-implement option, in favor of using Fan's Java/C# wrapuntil it turns out not to be good enough.
The problem is the backwards-compatibility issue. People would be writing non-portable code meanwhile, and fixing that later may break their code for future versions of Fan.
brianMon 8 Jun 2009
I think the issue with a pure Fan implementation is that it will likely suck wind in JavaScript versus using the JavaScript VM's native regex engine.
So I'd more inclined to:
figure how Fan API translates into native regex
only support features portable across JVM, CLR, and JS
I am not personally an expert on regex, so this would be a great project for some one else to step up and take ownership.
tompalmerMon 8 Jun 2009
likely suck wind in JavaScript versus using the JavaScript VM's native regex engine
Forgot about that issue. Good point.
KevinKelleyMon 8 Jun 2009
Digging into some docs...
Java 1.6 doc claims level 1 compliance with UnicodeTS#18, and looks to me pretty extensive and complete.
C# (Windows SDK .NET framework 3.5) Regex looks very similar, but I can't find a standards compliance statement.
Don't know about JS.
Both the Java and C# go far beyond what I would do, if I had to re-implement.
Both use a similar scheme: a Regex class, initialized with a string regex, compiles the regex into a state machine. A Matcher class is then used to match the regex against an input string.
Fan follows that scheme, with a minimal Regex and Matcher that wrap the underlying implementations.
only support features portable across JVM, CLR, and JS
That's the hard part; the meat of regex's is the language they understand. Without parsing the regex initialization string, I don't see how to support or not support anything.
On the positive side: regex's have been around for 50 years or so; both Java and C# are fairly mature; at least until you get very very deep in esoterica, it seems likely that there won't be a problem.
My suggestion is just to say, Fan supports regular expressions as implemented by the runtime you're using.
Collecting a bunch of test cases, representing as many as possible of the various features of regex's, seems like a good idea.
brianMon 8 Jun 2009
Collecting a bunch of test cases, representing as many as possible of the various features of regex's, seems like a good idea.
I agree, I think to proceed:
define what features we need above and beyond what is there?
write the tests and prototype in Java and C# and JS
see what problems we run into
Like I said, I would really like someone else to take on this project - any volunteers?
tool69Mon 8 Jun 2009
I agree with Kevin,
ie just using them inside the runtime you're in. That's simpler, and if we have a new runtime, we'll have less problems to solve.
I've bookmarked these blog posts about regex implementation (in Factor), they're interesting.
Also, has anyone started something like a Parsing Expression Grammar Lib in Fan ?
tompalmerMon 8 Jun 2009
Another option is find compliant implementations of ECMAScript regexes in Java and C#. It's a cheating technique that might do the job for now.
tcolarMon 8 Jun 2009
I don't think you want to write your own ... this is pretty tricky (and not very fun) stuff.
Javascript and Java have similar impl. of regexp i believe, and it seem .net is close as well.
Couldn't you just write a Fan "decorator" for those implementations ?
I have a few java projects that make heavy use of regexp (one of them is a wiki, kinda like sidewalk), the parser part was made to be callable standalone to do templating stuff, that i have some test for. So whatever implementation you end-up with i could probably run that using the fan parser as some test case.
JohnDGMon 8 Jun 2009
That would mean you can take a cross-platform pod, switch from one platform to another, and suddenly your code fails due to unknown reasons.
Regular expressions need to be cross-platform. I'd second tompalmer's suggestion: find a Java/C# implementations of JavaScript's RegExp class (Rhino for Java, not sure about .NET). Thus JavaScript's regular expression grammar will become the defining grammar for Fan.
Couldn't you just write a Fan "decorator" for those implementations ?
This is indeed possible. You just need to parse the regexp and convert it to the native format (very little would need be done, it's mainly the parsing). It's the "best" solution but not one finding favor here.
KevinKelleyMon 8 Jun 2009
Also, has anyone started something like a Parsing Expression Grammar Lib in Fan ?
Not a PEG but I'm about to release a CFG parser library, in Fan, with a grammar-editor gui (also in Fan).
PEGs sound interesting -- might look at that down the road.
jodastephenMon 8 Jun 2009
I think that remaining as is, wrapping the Java and C# APIs will be perfectly fine for the vast majority of use cases. Has anyone come up with a real use-case where there is a difference that would cause problems?
It would be easy enough to add a new pod independent of the core at a later date for a Regex that is written in Fan, but for now this seems to be chasing after an edge case.
tompalmerTue 9 Jun 2009
I did a super quick review. It seems like Java and .NET have named character classes (like "\p{Lu}") but not always with the same names, and I don't see this feature in ECMAScript.
Also, it seems .NET has some features like comments and named captures that I don't see in Java or ECMAScript.
Could be I'm mistaken about the above, however, and I might have missed other differences.
brianTue 9 Jun 2009
Any volunteers to take ownership of this feature?
tompalmerTue 9 Jun 2009
John and I are the ones who seem to care, so it might have to be one of us. But I can't commit the time for anything major. I did just put up a blog post with a call for help on the subject.
My current thought is that if ECMAScript 3 regexes are a proper subset of both Java and .NET, then we'll be able to say, "stick to ECMAScript 3 standardized features, and you'll be safe". If you are willing to commit to that for future platforms (if any). Alternatively, the slightly looser guarantee is, "uses the underlying platform, but as of today that means you are safe if you stick to ECMAScript 3 features".
qualidafialTue 9 Jun 2009
My feeling on this is that we should just use the underlying regex support of the platform. However the current state of documentation in sys::Regex is virtually nonexistent wrt what patterns are supported. The fandoc should document each pattern that is known to be supported on all platforms. That way if anybody uses an undocumented pattern (say .NET's named captures), then they are on their own with regard to portability.
What I wonder is whether there is an unencumbered regex specification which we can use in Fan's regex docs.
tompalmerTue 9 Jun 2009
What I wonder is whether there is an unencumbered regex specification which we can use in Fan's regex docs.
I haven't checked, but my first guess is that we won't be able to steal the text of the ECMAScript spec.
KevinKelleyTue 9 Jun 2009
I think a good place to start is the Java docs in the summary of regular expression constructs section.
I'd like to see a similar summary in the Fandoc, and it would be nice to have a set of test cases for everything in that summary; first so as to verify that they actually work on the Java platform, then to test whether they also work on C# and JS.
qualidafialTue 9 Jun 2009
There's a regex cheatsheat here that embeds lots of (presumably javascript) assert expressions that seem easy to port. It might be possible to get permission from them to use these asserts as a starting point for tests in Fan.
KevinKelleyWed 10 Jun 2009
Any volunteers to take ownership of this feature?
I won't admit to ownership, but since I'm in the middle of Fan regexes anyway I've started to add some tests to testSys/RegexTest.fan to exercise the regex features list from the javadocs.
When I get more or less through the list (or get sidetracked) I'll pass it on, so they can be added in; then fant can let us know if a runtime doesn't support a feature.
There's actually a pretty long list of features in the Java and the C# regexes (not sure about JS/Ecmascript).
brianWed 10 Jun 2009
tests would be great!
KevinKelleyMon 15 Jun 2009
Okay, I've about had my fill of regexes now. :-) I've added tests to RegexTest.fan for nearly all the features listed in in the java language documentation, as discussed above.
Number of verifies has gone up from 102 to 962. All pass, when running on a java 1.6 VM.
These are mainly existence tests, not fully exercising the functionality but verifying that each feature exists, is recognized, and works for at least a simple test case.
There are new tests for
special characters (escapes etc, but didn't do control codes)
character classes
Posix character classes
Java character classes (I'd guess these will fail on non-java VM)
Unicode blocks (very briefly)
boundary matchers
Quantifiers: greedy, reluctant, and possessive
logical operators (duh)
back references
quotation
There's a section for Special Constructs that I don't understand and didn't do.
diff -r f172726f9e28 src/testSys/fan/RegexTest.fan
--- a/src/testSys/fan/RegexTest.fan Mon May 18 16:30:50 2009 -0400
+++ b/src/testSys/fan/RegexTest.fan Mon Jun 15 11:52:22 2009 -0500
@@ -134,5 +134,403 @@
}
}
+//////////////////////////////////////////////////////////////////////////
+// Special characters, control chars
+//////////////////////////////////////////////////////////////////////////
+
+ Void testSpecialChars()
+ {
+ // build a str where the character index is the character value.
+ str := (0..255).toList.map(Str[,])|Int ch->Str|{ch.toChar}.join
+
+ // verify backslash escapes - tab, nl, etc.
+ m := Regex<|[\t\n\r\f\a\e]|>.matcher(str)
+ verifyMultipleMatches(m,
+ [ ["\u0007", 7, 8], ["\u0009", 9,10], ["\u000a",10,11],
+ ["\u000c",12,13], ["\u000d",13,14], ["\u001b",27,28] ])
+
+ // hex space & tab; octal 'A' & 'B'; unicode 'a' & 'b'
+ m = Regex<|[\x20\x09\0101\0102\u0061\u0062]|>.matcher(str)
+ verifyMultipleMatches(m,
+ [ ["\t",9,10], [" ",32,33],
+ ["A",65,66], ["B",66,67],
+ ["a",97,98], ["b",98,99] ])
+
+ // ought to do the control characters: \cC, for ^C etc, but I don't
+ // know the list of codes/values offhand
+
+ // digit and nondigit
+ m = Regex<|\d\D|>.matcher(str)
+ verifyMultipleMatches(m, [ ["9:",'9','9'+2] ])
+
+ // space and nonspace. Space is: [\t\n\x0b\f\r ] == 0x09-0x0d + 0x20
+ m = Regex<|(\s\s\s\s)?\s\S|>.matcher(str)
+ verifyMultipleMatches(m,
+ [ ["\u0009\u000a\u000b\u000c\u000d\u000e",9,15], ["\u0020\u0021",32,34] ])
+
+ // word and nonword. Word is: [a-zA-Z0-9_]
+ m = Regex<|\w\W|>.matcher(str)
+ verifyMultipleMatches(m,
+ [ ["9:",57,59], ["Z[",90,92],
+ ["_`",95,97], ["z{",122,124] ])
+
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// Character classes
+//////////////////////////////////////////////////////////////////////////
+
+ Void testClasses()
+ { // 1111111111222222222233333333334444
+ //01234567890123456789012345678901234567890123
+ qbf := "the Quick Brown fox jumped Over the Lazy dog"
+
+ // union and range
+ m := Regex<|[abc[A-Z]]|>.matcher(qbf)
+ verifyMultipleMatches(m, [ ["Q",4,5], ["c",7,8], ["B",10,11], ["O",27,28], ["L",36,37], ["a",37,38] ])
+
+ // negative class
+ m = Regex<|[^a-z ]|>.matcher(qbf)
+ verifyMultipleMatches(m, [ ["Q",4,5], ["B",10,11], ["O",27,28], ["L",36,37] ])
+
+ // intersection
+ m = Regex<|[a-z&&[fg]]|>.matcher(qbf)
+ verifyMultipleMatches(m, [ ["f",16,17], ["g",43,44] ])
+
+ // subtraction
+ m = Regex<|[a-z&&[^c-x]]|>.matcher(qbf)
+ verifyMultipleMatches(m, [ ["a",37,38], ["z",38,39], ["y",39,40] ])
+ }
+
+ // cycle the matcher and verify that it finds all (and only)
+ // the expected matches.
+ Void verifyMultipleMatches(RegexMatcher m, Obj[][] expected)
+ {
+ expected.each |Obj[] x|
+ {
+ verify(m.find)
+ verifyEq(m.group, x[0])
+ verifyEq(m.start, x[1])
+ verifyEq(m.end, x[2])
+ }
+ verify(!m.find)
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// Posix character classes
+//////////////////////////////////////////////////////////////////////////
+
+ Void testPosixClasses()
+ {
+ // \p{xxx} or \P{xxx} where xxx is one of:
+ // Lower Upper ASCII Alpha Digit Alnum
+ // Punct Graph Print Blank Cntrl XDigit Space
+ // \p means in the class, \P means not in the class
+ // \p{Lower} A lower-case alphabetic character: [a-z]
+ // \p{Upper} An upper-case alphabetic character:[A-Z]
+ // \p{ASCII} All ASCII:[\x00-\x7F]
+ // \p{Alpha} An alphabetic character:[\p{Lower}\p{Upper}]
+ // \p{Digit} A decimal digit: [0-9]
+ // \p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}]
+ // \p{Punct} Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
+ // \p{Graph} A visible character: [\p{Alnum}\p{Punct}]
+ // \p{Print} A printable character: [\p{Graph}\x20]
+ // \p{Blank} A space or a tab: [ \t]
+ // \p{Cntrl} A control character: [\x00-\x1F\x7F]
+ // \p{XDigit} A hexadecimal digit: [0-9a-fA-F]
+ // \p{Space} A whitespace character: [ \t\n\x0B\f\r]
+
+ verifyMatches(Regex<|\p{Lower}|>, "a", true)
+ verifyMatches(Regex<|\P{Lower}|>, "A", true)
+
+ verifyMatches(Regex<|\p{Upper}|>, "A", true)
+ verifyMatches(Regex<|\P{Upper}|>, "a", true)
+
+ verifyMatches(Regex<|\p{ASCII}|>, "a", true)
+ verifyMatches(Regex<|\P{ASCII}|>, "A", false)
+
+ verifyMatches(Regex<|\p{Alpha}|>, "a", true)
+ verifyMatches(Regex<|\P{Alpha}|>, "_", true)
+
+ verifyMatches(Regex<|\p{Digit}|>, "1", true)
+ verifyMatches(Regex<|\P{Digit}|>, "A", true)
+
+ verifyMatches(Regex<|\p{Alnum}|>, "a", true)
+ verifyMatches(Regex<|\P{Alnum}|>, "_", true)
+
+ verifyMatches(Regex<|\p{Punct}|>, ":", true)
+ verifyMatches(Regex<|\P{Punct}|>, "A", true)
+
+ verifyMatches(Regex<|\p{Graph}|>, "a", true)
+ verifyMatches(Regex<|\P{Graph}|>, "\t", true)
+
+ verifyMatches(Regex<|\p{Print}|>, " ", true)
+ verifyMatches(Regex<|\P{Print}|>, "\t", true)
+
+ verifyMatches(Regex<|\p{Blank}|>, " ", true)
+ verifyMatches(Regex<|\P{Blank}|>, "A", true)
+
+ verifyMatches(Regex<|\p{Cntrl}|>, "\t", true)
+ verifyMatches(Regex<|\P{Cntrl}|>, "A", true)
+
+ verifyMatches(Regex<|\p{XDigit}|>, "a", true)
+ verifyMatches(Regex<|\P{XDigit}|>, "g", true)
+
+ verifyMatches(Regex<|\p{Space}|>, "\t", true)
+ verifyMatches(Regex<|\P{Space}|>, "A", true)
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// java.lang.Character classes
+//////////////////////////////////////////////////////////////////////////
+
+ Void testJavaCharacterClasses()
+ {
+ // \p{xxx} or \P{xxx} where xxx is one of:
+ // \p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase()
+ // \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
+ // \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace()
+ // \p{javaMirrored} Equivalent to java.lang.Character.isMirrored()
+
+ verifyMatches(Regex<|\p{javaLowerCase}|>, "a", true)
+ verifyMatches(Regex<|\P{javaLowerCase}|>, "A", true)
+
+ verifyMatches(Regex<|\p{javaUpperCase}|>, "A", true)
+ verifyMatches(Regex<|\P{javaUpperCase}|>, "a", true)
+
+ verifyMatches(Regex<|\p{javaWhitespace}|>, " ", true)
+ verifyMatches(Regex<|\P{javaWhitespace}|>, "a", true)
+
+ // not even sure what 'mirrored' means
+ verifyMatches(Regex<|\p{javaMirrored}|>, "a", false)
+ verifyMatches(Regex<|\P{javaMirrored}|>, "a", true)
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// Unicode blocks and categories
+//////////////////////////////////////////////////////////////////////////
+
+ Void testUnicodeBlocks()
+ {
+ // \p{xxx} or \P{xxx} where xxx is one of:
+ // InGreek Lu Sc (L is letter, Lu is uppercase, S is Symbol, Sc is currency)
+ // @see java.lang.Character.UnicodeBlock for supported blocks
+
+ verifyMatches(Regex<|\p{L}|>, "a", true)
+ verifyMatches(Regex<|\p{Lu}|>, "A", true)
+ verifyMatches(Regex<|\p{S}|>, ":", false)
+ verifyMatches(Regex<|\p{Sc}|>, "\$", true)
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// Boundary matchers - match to start or end of text / line
+//////////////////////////////////////////////////////////////////////////
+
+ Void testBoundaryMatchers()
+ {
+ // ^ $ \b \B \A \G \Z \z
+ // ^ The beginning of a line
+ // $ The end of a line
+ // \b A word boundary
+ // \B A non-word boundary
+ // \A The beginning of the input
+ // \G The end of the previous match
+ // \Z The end of the input but for the final terminator, if any
+ // \z The end of the input
+
+ m := Regex<|^a.*d$|>.matcher("abcd")
+ verifyGroups(m, [ ["abcd", 0, 4] ])
+ verify(!m.find)
+
+ m = Regex<|\b..\b|>.matcher("ab cd")
+ verifyGroups(m, [ ["ab", 0, 2] ])
+ verifyGroups(m, [ ["cd", 3, 5] ])
+ verify(!m.find)
+
+ m = Regex<|.\B.|>.matcher("ab:cd")
+ verifyGroups(m, [ ["ab", 0, 2] ])
+ verifyGroups(m, [ ["cd", 3, 5] ])
+ verify(!m.find)
+
+ // need \G test
+
+ m = Regex<|\A.*\Z|>.matcher("abcd")
+ verifyGroups(m, [ ["abcd", 0, 4] ])
+ verify(!m.find)
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// Greedy
+//////////////////////////////////////////////////////////////////////////
+
+ Void testGreedyQuantifiers()
+ {
+ // X? X, once or not at all
+ // X* X, zero or more times
+ // X+ X, one or more times
+ // X{n} X, exactly n times
+ // X{n,} X, at least n times
+ // X{n,m} X, at least n but not more than m times
+
+ m := Regex<|(a*)(a+)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aaa", 0, 3], ["a", 3, 4] ])
+
+ m = Regex<|(a*)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aaaa", 0, 4], ["", 4, 4] ])
+
+ m = Regex<|(a?)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["a", 0, 1], ["aaa", 1, 4] ])
+
+ m = Regex<|(a+)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aaaa", 0, 4], ["", 4, 4] ])
+
+ m = Regex<|(a{2})(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aa", 0, 2], ["aa", 2, 4] ])
+
+ m = Regex<|(a{2,})(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aaaa", 0, 4], ["", 4, 4] ])
+
+ m = Regex<|(a{2,3})(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aaa", 0, 3], ["a", 3, 4] ])
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// Reluctant
+//////////////////////////////////////////////////////////////////////////
+
+ Void testReluctantQuantifiers()
+ {
+ // X?? X, once or not at all
+ // X*? X, zero or more times
+ // X+? X, one or more times
+ // X{n}? X, exactly n times
+ // X{n,}? X, at least n times
+ // X{n,m}? X, at least n but not more than m times
+
+ m := Regex<|(a*?)(a+)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["", 0, 0], ["aaaa", 0, 4] ])
+
+ m = Regex<|(a*?)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["", 0, 0], ["aaaa", 0, 4] ])
+
+ m = Regex<|(a??)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["", 0, 0], ["aaaa", 0, 4] ])
+
+ m = Regex<|(a+?)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["a", 0, 1], ["aaa", 1, 4] ])
+
+ m = Regex<|(a{2}?)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aa", 0, 2], ["aa", 2, 4] ])
+
+ m = Regex<|(a{2,}?)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aa", 0, 2], ["aa", 2, 4] ])
+
+ m = Regex<|(a{2,3}?)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aa", 0, 2], ["aa", 2, 4] ])
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// Posessive
+//////////////////////////////////////////////////////////////////////////
+
+ Void testPosessiveQuantifiers()
+ {
+ // X?+ X, once or not at all
+ // X*+ X, zero or more times
+ // X++ X, one or more times
+ // X{n}+ X, exactly n times
+ // X{n,}+ X, at least n times
+ // X{n,m}+ X, at least n but not more than m times
+
+ m := Regex<|(a*+)a|>.matcher("aaaa")
+ //verifyGroups(m, [ ["aaaa", 0, 4], ["aaaa", 0, 4] ], [null, -1, -1] ])
+ verify(!m.find) // possessive group steals all the a's
+
+ m = Regex<|(a*+)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aaaa", 0, 4], ["", 4, 4] ])
+
+ m = Regex<|(a?+)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["a", 0, 1], ["aaa", 1, 4] ])
+
+ m = Regex<|(a++)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aaaa", 0, 4], ["", 4, 4] ])
+
+ m = Regex<|(a{2}+)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aa", 0, 2], ["aa", 2, 4] ])
+
+ m = Regex<|(a{2,}+)a|>.matcher("aaaa")
+ verify(!m.find)
+
+ m = Regex<|(a{2,3}+)(a*)|>.matcher("aaaa")
+ verifyGroups(m, [ ["aaaa", 0, 4], ["aaa", 0, 3], ["a", 3, 4] ])
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// Logical operators
+//////////////////////////////////////////////////////////////////////////
+
+ Void testLogicalOperators()
+ {
+ // XY X followed by Y
+ // X|Y Either X or Y
+ // (X) X, as a capturing group
+
+ verifyMatches(Regex.fromStr("XY"), "XY", true)
+ verifyMatches(Regex.fromStr("X|Y"), "X", true)
+ verifyMatches(Regex.fromStr("X|Y"), "Y", true)
+ m := Regex<|(X)|>.matcher("X")
+ verifyGroups(m, [ ["X", 0, 1], ["X", 0, 1] ])
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// Back references
+//////////////////////////////////////////////////////////////////////////
+
+ Void testBackReferences()
+ {
+ // \n Whatever the nth capturing group matched
+ m := Regex<|(a(b)c)d\2\1|>.matcher("abcdbabc")
+ verifyGroups(m, [ ["abcdbabc", 0, 8], ["abc", 0, 3], ["b", 1, 2] ])
+ verify(!m.find)
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// Quotation
+//////////////////////////////////////////////////////////////////////////
+
+ Void testQuotation()
+ {
+ // \ Nothing, but quotes the following character
+ // \Q Nothing, but quotes all characters until \E
+ // \E Nothing, but ends quoting started by \Q
+
+ // assume normal backslash quote is tested elsewhere;
+ // verify we understand \Q..\E
+ // NOTE: this does regex-quoting, quotes characters so they don't
+ // have special meaning to Regex. Doesn't do string quoting, so
+ // \Qt\E doesn't make a tab.
+ m := Regex<|\Q[.]trn\E|>.matcher("[.]trn")
+ verifyGroups(m, [ ["[.]trn", 0, 6] ])
+ verify(!m.find)
+
+ }
+
+//////////////////////////////////////////////////////////////////////////
+// Special constructs
+//////////////////////////////////////////////////////////////////////////
+
+ Void testSpecialConstructs()
+ {
+ // (?:X) X, as a non-capturing group
+ // (?idmsux-idmsux) Nothing, but turns match flags i d m s u x on - off
+ // (?idmsux-idmsux:X) X, as a non-capturing group with the given flags i d m s u x on - off
+ // (?=X) X, via zero-width positive lookahead
+ // (?!X) X, via zero-width negative lookahead
+ // (?<=X) X, via zero-width positive lookbehind
+ // (?<!X) X, via zero-width negative lookbehind
+ // (?>X) X, as an independent, non-capturing group
+
+ }
+
}
brianWed 17 Jun 2009
Kevin, thank you very much for your patch - I pushed into hg.
At first glance it would appear we have a lot of problems with portability since many of the tests you added fail in .NET:
-- Run: testSys::RegexTest.testIdentity...
Pass: testSys::RegexTest.testIdentity [5]
-- Run: testSys::RegexTest.testGlob...
Pass: testSys::RegexTest.testGlob [14]
-- Run: testSys::RegexTest.testSplit...
Pass: testSys::RegexTest.testSplit [12]
-- Run: testSys::RegexTest.testMatches...
Pass: testSys::RegexTest.testMatches [30]
-- Run: testSys::RegexTest.testGroups...
TEST FAILED
sys::TestErr: Test failed
sys::Test.fail (Test.cs:178)
sys::Test.verify (Test.cs:64)
sys::Test.verify (Test.cs:61)
testSys::RegexTest.testGroups (RegexTest.fan:134)
-- Run: testSys::RegexTest.testSpecialChars...
TEST FAILED
sys::Err: Exception of type 'System.Exception' was thrown.
sys::RegexMatcher.group (RegexMatcher.cs:60)
sys::RegexMatcher.group (RegexMatcher.cs:56)
testSys::RegexTest.verifyMultipleMatches (RegexTest.fan:240)
sys::List.each (List.cs:505)
testSys::RegexTest.verifyMultipleMatches (RegexTest.fan:237)
testSys::RegexTest.testSpecialChars (RegexTest.fan:178)
-- Run: testSys::RegexTest.testClasses...
TEST FAILED
sys::Err: Exception of type 'System.Exception' was thrown.
sys::RegexMatcher.group (RegexMatcher.cs:60)
sys::RegexMatcher.group (RegexMatcher.cs:56)
testSys::RegexTest.verifyMultipleMatches (RegexTest.fan:240)
sys::List.each (List.cs:505)
testSys::RegexTest.verifyMultipleMatches (RegexTest.fan:237)
testSys::RegexTest.testClasses (RegexTest.fan:218)
-- Run: testSys::RegexTest.testPosixClasses...
TEST FAILED
sys::ArgErr: parsing "\p{Lower}" - Unknown property 'Lower'.
at System.Text.RegularExpressions.RegexCharClass.SetFromProperty (Unknown Source)
at System.Text.RegularExpressions.RegexCharClass.AddCategoryFromName (Unknown Source)
at System.Text.RegularExpressions.RegexParser.ScanBackslash (Unknown Source)
at System.Text.RegularExpressions.RegexParser.ScanRegex (Unknown Source)
at System.Text.RegularExpressions.RegexParser.Parse (Unknown Source)
at System.Text.RegularExpressions.Regex..ctor (Unknown Source)
at System.Text.RegularExpressions.Regex..ctor (Unknown Source)
sys::Regex..ctor (Regex.cs:47)
sys::Regex.fromStr (Regex.cs:27)
testSys::RegexTest.testPosixClasses (RegexTest.fan:271)
-- Run: testSys::RegexTest.testJavaCharacterClasses...
TEST FAILED
sys::ArgErr: parsing "\p{javaLowerCase}" - Unknown property 'javaLowerCase'.
at System.Text.RegularExpressions.RegexCharClass.SetFromProperty (Unknown Source)
at System.Text.RegularExpressions.RegexCharClass.AddCategoryFromName (Unknown Source)
at System.Text.RegularExpressions.RegexParser.ScanBackslash (Unknown Source)
at System.Text.RegularExpressions.RegexParser.ScanRegex (Unknown Source)
at System.Text.RegularExpressions.RegexParser.Parse (Unknown Source)
at System.Text.RegularExpressions.Regex..ctor (Unknown Source)
at System.Text.RegularExpressions.Regex..ctor (Unknown Source)
sys::Regex..ctor (Regex.cs:47)
sys::Regex.fromStr (Regex.cs:27)
testSys::RegexTest.testJavaCharacterClasses (RegexTest.fan:323)
-- Run: testSys::RegexTest.testUnicodeBlocks...
Pass: testSys::RegexTest.testUnicodeBlocks [33]
-- Run: testSys::RegexTest.testBoundaryMatchers...
TEST FAILED
sys::TestErr: Test failed
sys::Test.fail (Test.cs:178)
sys::Test.verify (Test.cs:64)
sys::Test.verify (Test.cs:61)
testSys::RegexTest.testBoundaryMatchers (RegexTest.fan:371)
-- Run: testSys::RegexTest.testGreedyQuantifiers...
Pass: testSys::RegexTest.testGreedyQuantifiers [98]
-- Run: testSys::RegexTest.testReluctantQuantifiers...
Pass: testSys::RegexTest.testReluctantQuantifiers [98]
-- Run: testSys::RegexTest.testPosessiveQuantifiers...
TEST FAILED
sys::ArgErr: parsing "(a*+)a" - Nested quantifier +.
at System.Text.RegularExpressions.RegexParser.ScanRegex (Unknown Source)
at System.Text.RegularExpressions.RegexParser.Parse (Unknown Source)
at System.Text.RegularExpressions.Regex..ctor (Unknown Source)
at System.Text.RegularExpressions.Regex..ctor (Unknown Source)
sys::Regex..ctor (Regex.cs:47)
sys::Regex.fromStr (Regex.cs:27)
testSys::RegexTest.testPosessiveQuantifiers (RegexTest.fan:473)
-- Run: testSys::RegexTest.testLogicalOperators...
Pass: testSys::RegexTest.testLogicalOperators [38]
-- Run: testSys::RegexTest.testBackReferences...
TEST FAILED
sys::TestErr: Test failed
sys::Test.fail (Test.cs:178)
sys::Test.verify (Test.cs:64)
sys::Test.verify (Test.cs:61)
testSys::RegexTest.testBackReferences (RegexTest.fan:522)
-- Run: testSys::RegexTest.testQuotation...
TEST FAILED
sys::ArgErr: parsing "\Q[.]trn\E" - Unrecognized escape sequence \Q.
at System.Text.RegularExpressions.RegexParser.ScanCharEscape (Unknown Source)
at System.Text.RegularExpressions.RegexParser.ScanBasicBackslash (Unknown Source)
at System.Text.RegularExpressions.RegexParser.ScanBackslash (Unknown Source)
at System.Text.RegularExpressions.RegexParser.ScanRegex (Unknown Source)
at System.Text.RegularExpressions.RegexParser.Parse (Unknown Source)
at System.Text.RegularExpressions.Regex..ctor (Unknown Source)
at System.Text.RegularExpressions.Regex..ctor (Unknown Source)
sys::Regex..ctor (Regex.cs:47)
sys::Regex.fromStr (Regex.cs:27)
testSys::RegexTest.testQuotation (RegexTest.fan:540)
-- Run: testSys::RegexTest.testSpecialConstructs...
Pass: testSys::RegexTest.testSpecialConstructs [0]
Time: 15719ms
-- failed: testSys::RegexTest.testGroups
-- failed: testSys::RegexTest.testSpecialChars
-- failed: testSys::RegexTest.testClasses
-- failed: testSys::RegexTest.testPosixClasses
-- failed: testSys::RegexTest.testJavaCharacterClasses
-- failed: testSys::RegexTest.testBoundaryMatchers
-- failed: testSys::RegexTest.testPosessiveQuantifiers
-- failed: testSys::RegexTest.testBackReferences
-- failed: testSys::RegexTest.testQuotation
***
*** 9 FAILURES [1 tests, 9 methods, 328 verifies]
***
KevinKelleyWed 17 Jun 2009
Some of those fails surprise me -- the Groups, SpecialChars, Classes and maybe PosixClasses I'd have guessed would be okay, just from skimming the Windows docs.
It's hard to see from this where the problem is; it looks like the line numbers in the stack traces are pointing to the end of the method that failed, instead of to the failing line in the method.
I guess the next thing is to sit down with both runtimes and narrow down (and document!) what the incompatibilities are.
I'll try to spend some time with it, maybe over the weekend, see if we can get a clearer picture of what's what. What we certainly need is at least a documented set of basic functionality that is demonstrated to work the same everywhere. It looks like there is a lot of stuff that does work, but there's a lot of failure/incompatibility mixed in too.
andyWed 17 Jun 2009
There are outstanding issues with the .NET regex - primarily issues with grouping if I remember. So that may be contributing to some of those issues - haven't had a change to dig in to see for sure yet though.
KevinKelleyThu 18 Jun 2009
Sounds likely... the .net errors start with testGroups which I didn't change, and I used the verifyGroups utility method for quite a few of my tests.
Looking at the .net docs, they say they number groups in innermost-leftmost-first order, which I think is Microsoft being cute; Java seems to follow Perl and every other implementation in numbering capture groups by the order of occurrence of the open paren, which I guess would be outermost-leftmost-first order.
Haven't actually verified this is what's going on, but that's how I read it.
Couple days, I'll try to test it out, if nobody beats me to it.
KevinKelleyMon 22 Jun 2009
To follow up, looks like I was wrong above -- group numbering for RE subgroups does seem to work the same in Java and .NET. The failures are coming from a couple inconsistencies -- with initialization, and with what happens on failed matches.
In Java, the data from a RegexMatcher isn't valid until you call find() or matches().
In .NET, when you get a regex matcher (Match class), it comes pre-initialized. So find() was implemented as a no-op in .NET; that's fine for the first match but leaves you without a way to advance through the string.
For testing I changed the .NET to use a lazy initializer, but really I think the Java side ought to initialize itself (with a find() call in the constructor).
Another thing causing .NET to fail is, in Java (and as documented in Fan), you can get different results from the group methods: an exception if the matcher hasn't been initialized or doesn't match anything; or a null if the matcher's okay but this group doesn't match; or a (possibly empty) string if the group does match.
In .NET there's a Status field, and no exception. So it can be rewritten to check status and throw; but is that really best?
Anyway...
There's a number of other failures, that seem to relate to relate to specific features rather than general usage. Those can, I think, just be documented as depending on the runtime. The general usage stuff -- whether or not you have to call find() first; whether exceptions are thrown; etc; needs to be decided and made to work the same on all runtimes.
Any thoughts? Is anybody depending on current behavior at all?
brianMon 22 Jun 2009
We don't use a whole lot of regex in the codebase, so if you have a patch that makes things better with an initial call to find, please email it and I'll take a look.
Although seems like maybe it might be better to use Java's conventions, and fix .NET to match?
KevinKelleyMon 22 Jun 2009
I dislike the Java convention here, because it exposes an object with uninitialized data -- if you forget to call find (or don't know you should) you get exceptions on the group, start, and end methods. That one's bit me several times already.
On the other hand... you don't want to have it looking like the Java convention, but not working like Java does. Also, the semantics (call find, then examine what you found; repeat until find fails) seems to call for it working the way Java does it.
I think if we were to change how it works on Java side, I'd want to change the method to findNext or next or nextMatch, to better express that there's already a first find.
For now I'm just going to tinker with the .NET a little bit more, get it to pass as many tests as I easily can.
JohnDGTue 23 Jun 2009
if you forget to call find (or don't know you should) you get exceptions on the group, start, and end methods.
I discuss these kinds of problems in Part 3 of my Good API Design series. Neither the .NET or Java APIs are unambiguous, both suffer from the design smell of a class acquiring many concerns (mixing the mechanism of finding with the result of finding).
We haven't touched JS regex target but Google did in GWT 2.1 RegExp. It translates API calls to native JS regex calls under the cover, pure Java implementation uses Java's Pattern class. I think this part of Apache 2.0 Licensed code is something we should reuse.
brianSun 28 Nov 2010
At this point I am not concerned with Java vs .NET differences. I am only concerned with differences between Java and JavaScript.
I think we can safely say that JavaScript behavior is what Fantom should use. I say this because adding an alternate RegEx implementation to Java is far easier/more efficient that adding an alternate RegEx implementation to JavaScript (where code overhead is a very, very big deal, not to mention a JS implementation vs native will be way slow).
Or alternatively we could just say RegEx isn't completely portable and just use the default implementation of each platform. We are already sort of doing this for Int/Float for efficiency. I would probably vote for this direction. What does everybody else think? Adding an alternate implementation to Java would seriously bloat things like JarDists.
I am not a RegEx power user, so this is an area where we need some community help. How well is the existing implementation working for everybody?
umanThu 21 Apr 2011
Hi,
I'm late to this discussion, but I created a Regex Support.{xls,html,pdf} file that highlights the differences between .NET, Java and ECMA regex support. It's a cut-down version of the table at http://www.regular-expressions.info/refflavors.html.
I'd like to pass this on the fandev group, so that when crossplatform regex support gets tackled, there's documentation of what differences need to handled. How do I submit the files?
Here's a sample:
Anchors
Feature .NET Java ECMA
^ (start of string/line) YES YES YES
$ (end of string/line) YES YES YES
\A (start of string) YES YES no <-- highlighted in red
brianFri 22 Apr 2011
I'd like to pass this on the fandev group, so that when crossplatform regex support gets tackled, there's documentation of what differences need to handled
When it comes to regex, I'm sort of thinking we don't bother adding a complex layer to guarantee 100% portability, but rather rely on regex engine provided by the platform. I think Java's slavish rule to only supporting a feature if it could be 100% ported to every platform has been a real setback in getting features like chmod, etc into the API. Plus we have already started this direction with the decision to keep Int/Float as true JavaScript Numbers for performance reasons.
How do I submit the files?
If it is simple you can just post diff here. But typically for more complex patches, we do it offline with email.
So let me ask this question? Is there anything missing from current Fantom Regex API that we need to add? If not, then I'd say lets just wrap up what we have for JS and call this resolved (at least for now).
DanielFathFri 22 Apr 2011
Wouldn't relying on different regex engines cause software incompatibility and difficulty debugging?
ahhatemFri 22 Apr 2011
I think that what he meant is that he just want to grow the documentation to show the differences between the platforms so that the reader has a reference of what he need to consider when coding, I do not think he want to enhance the regex engine in anyway...
IMHO, it is not that hard to make it compatible, I could just make a different regex for each platform and check on runtime... that is easy enough... Of course, any enhancements are nice, but not a priority...
umanSat 23 Apr 2011
is that he just want to grow the documentation to show the differences between the platforms so that the reader has a reference of what he need to consider when coding, I do not think he want to enhance the regex engine in anyway
Yes for documentation, but also no in that only by looking at the differences could we understand if something that is necessary is missing. Moreover, we may be able to spot forthcoming changes to Javascript.
For example, if I look at the columns that show YES for .NET and/or Java and "ascii" for ECMA/javascript, there are a few that I could anticipate will be fixed in later revisions of javascript.
Most of the limits on javascript regexes are quite reasonable, to preserve time and space efficiencies (limits on lookahead, forward references and the like). I rarely use some of the more esoteric features, so for me Javascript regexes are fine.
qualidafialThu 16 Feb 2012
Ping. I think this ticket is still open only as a reminder to document the differences in regex support by platform.
I think this ticket is still open only as a reminder to document the differences in regex support by platform
Actually I've left it open because I don't think RegexMatcher is implemented in JavaScript. At this point, I'm going to leave this ticket open until we implement the JavaScript implementation of the existing API we have working in Java.
This would be an ideal project for someone in the community to send us a patch if you are interested in pitching in.
SlimerDudeFri 9 Dec 2016
I'm going to leave this ticket open until we implement the JavaScript implementation of the existing API we have working in Java.
brian Mon 8 Jun 2009
We need to finish out the Regex API (and decide if/how we might deal with JVM, JS, and .NET incompatibilities).
brian Mon 8 Jun 2009
Promoted to ticket #630 and assigned to brian
JohnDG Mon 8 Jun 2009
It needs to be platform independent. Otherwise what's the point of Fan being cross-platform compatible???
Cross-platform compatibility implies one of two approaches:
RegExp
DSL could easily do it at compile time along with pattern checking).brian Mon 8 Jun 2009
I guess I'd have see how different they really are first. Are we talking obscure edge cases or substantial differences? Does anyone know?
tompalmer Mon 8 Jun 2009
I agree with JohnDG that the default should be full compatibility (including the same "you can't do thats") across platforms. I think his two options are both worth considering. A third is one that just prohibits features that aren't consistent across all platforms.
I'm sure there are differences between the platforms (vague memories from past experience), but I have not done a full nor recent study of the issues.
Doing a regex engine in Fan (option 1) is my favorite idea at the moment. Maybe there is some nicely licensed, fully featured, easy reading engine out there to base things on. Don't know.
KevinKelley Mon 8 Jun 2009
I think it would be worthwhile to do a regex implementation in Fan; I was scouting around the other day for regex links since I want them for the grammar generator I'm working on (Earley parser in Fan, with user interface).
A few links: Here's an article about implementing Thompson's NFA construction to avoid a nasty edge case that apparently exists in many common regex implementations; there's code in C but I'm not sure it's a production-level implementation; the article refers to a page of links to RE implementations, including Plan 9 grep;
TRE seems like a pretty complete and portable (C I think) implementation of POSIX regex, BSD licensed.
Doing it right seems to be non-trivial, so for the moment I'm tabling the re-implement option, in favor of using Fan's Java/C# wrap until it turns out not to be good enough.
tompalmer Mon 8 Jun 2009
The problem is the backwards-compatibility issue. People would be writing non-portable code meanwhile, and fixing that later may break their code for future versions of Fan.
brian Mon 8 Jun 2009
I think the issue with a pure Fan implementation is that it will likely suck wind in JavaScript versus using the JavaScript VM's native regex engine.
So I'd more inclined to:
I am not personally an expert on regex, so this would be a great project for some one else to step up and take ownership.
tompalmer Mon 8 Jun 2009
Forgot about that issue. Good point.
KevinKelley Mon 8 Jun 2009
Digging into some docs...
Java 1.6 doc claims level 1 compliance with UnicodeTS#18, and looks to me pretty extensive and complete.
C# (Windows SDK .NET framework 3.5) Regex looks very similar, but I can't find a standards compliance statement.
Don't know about JS.
Both the Java and C# go far beyond what I would do, if I had to re-implement.
Both use a similar scheme: a
Regex
class, initialized with a string regex, compiles the regex into a state machine. AMatcher
class is then used to match the regex against an input string.Fan follows that scheme, with a minimal
Regex
andMatcher
that wrap the underlying implementations.That's the hard part; the meat of regex's is the language they understand. Without parsing the regex initialization string, I don't see how to support or not support anything.
On the positive side: regex's have been around for 50 years or so; both Java and C# are fairly mature; at least until you get very very deep in esoterica, it seems likely that there won't be a problem.
My suggestion is just to say, Fan supports regular expressions as implemented by the runtime you're using.
Collecting a bunch of test cases, representing as many as possible of the various features of regex's, seems like a good idea.
brian Mon 8 Jun 2009
I agree, I think to proceed:
Like I said, I would really like someone else to take on this project - any volunteers?
tool69 Mon 8 Jun 2009
I agree with Kevin,
ie just using them inside the runtime you're in. That's simpler, and if we have a new runtime, we'll have less problems to solve.
I've bookmarked these blog posts about regex implementation (in Factor), they're interesting.
Also, has anyone started something like a Parsing Expression Grammar Lib in Fan ?
tompalmer Mon 8 Jun 2009
Another option is find compliant implementations of ECMAScript regexes in Java and C#. It's a cheating technique that might do the job for now.
tcolar Mon 8 Jun 2009
I don't think you want to write your own ... this is pretty tricky (and not very fun) stuff.
Javascript and Java have similar impl. of regexp i believe, and it seem .net is close as well.
Couldn't you just write a Fan "decorator" for those implementations ?
I have a few java projects that make heavy use of regexp (one of them is a wiki, kinda like sidewalk), the parser part was made to be callable standalone to do templating stuff, that i have some test for. So whatever implementation you end-up with i could probably run that using the fan parser as some test case.
JohnDG Mon 8 Jun 2009
That would mean you can take a cross-platform pod, switch from one platform to another, and suddenly your code fails due to unknown reasons.
Regular expressions need to be cross-platform. I'd second tompalmer's suggestion: find a Java/C# implementations of JavaScript's RegExp class (Rhino for Java, not sure about .NET). Thus JavaScript's regular expression grammar will become the defining grammar for Fan.
This is indeed possible. You just need to parse the regexp and convert it to the native format (very little would need be done, it's mainly the parsing). It's the "best" solution but not one finding favor here.
KevinKelley Mon 8 Jun 2009
Not a PEG but I'm about to release a CFG parser library, in Fan, with a grammar-editor gui (also in Fan).
PEGs sound interesting -- might look at that down the road.
jodastephen Mon 8 Jun 2009
I think that remaining as is, wrapping the Java and C# APIs will be perfectly fine for the vast majority of use cases. Has anyone come up with a real use-case where there is a difference that would cause problems?
It would be easy enough to add a new pod independent of the core at a later date for a Regex that is written in Fan, but for now this seems to be chasing after an edge case.
tompalmer Tue 9 Jun 2009
I did a super quick review. It seems like Java and .NET have named character classes (like "\p{Lu}") but not always with the same names, and I don't see this feature in ECMAScript.
Also, it seems .NET has some features like comments and named captures that I don't see in Java or ECMAScript.
Could be I'm mistaken about the above, however, and I might have missed other differences.
brian Tue 9 Jun 2009
Any volunteers to take ownership of this feature?
tompalmer Tue 9 Jun 2009
John and I are the ones who seem to care, so it might have to be one of us. But I can't commit the time for anything major. I did just put up a blog post with a call for help on the subject.
My current thought is that if ECMAScript 3 regexes are a proper subset of both Java and .NET, then we'll be able to say, "stick to ECMAScript 3 standardized features, and you'll be safe". If you are willing to commit to that for future platforms (if any). Alternatively, the slightly looser guarantee is, "uses the underlying platform, but as of today that means you are safe if you stick to ECMAScript 3 features".
qualidafial Tue 9 Jun 2009
My feeling on this is that we should just use the underlying regex support of the platform. However the current state of documentation in
sys::Regex
is virtually nonexistent wrt what patterns are supported. The fandoc should document each pattern that is known to be supported on all platforms. That way if anybody uses an undocumented pattern (say .NET's named captures), then they are on their own with regard to portability.What I wonder is whether there is an unencumbered regex specification which we can use in Fan's regex docs.
tompalmer Tue 9 Jun 2009
I haven't checked, but my first guess is that we won't be able to steal the text of the ECMAScript spec.
KevinKelley Tue 9 Jun 2009
I think a good place to start is the Java docs in the
summary of regular expression constructs
section.I'd like to see a similar summary in the Fandoc, and it would be nice to have a set of test cases for everything in that summary; first so as to verify that they actually work on the Java platform, then to test whether they also work on C# and JS.
qualidafial Tue 9 Jun 2009
There's a regex cheatsheat here that embeds lots of (presumably javascript) assert expressions that seem easy to port. It might be possible to get permission from them to use these asserts as a starting point for tests in Fan.
KevinKelley Wed 10 Jun 2009
I won't admit to ownership, but since I'm in the middle of Fan regexes anyway I've started to add some tests to testSys/RegexTest.fan to exercise the regex features list from the javadocs.
When I get more or less through the list (or get sidetracked) I'll pass it on, so they can be added in; then fant can let us know if a runtime doesn't support a feature.
There's actually a pretty long list of features in the Java and the C# regexes (not sure about JS/Ecmascript).
brian Wed 10 Jun 2009
tests would be great!
KevinKelley Mon 15 Jun 2009
Okay, I've about had my fill of regexes now. :-) I've added tests to RegexTest.fan for nearly all the features listed in in the java language documentation, as discussed above.
Number of verifies has gone up from 102 to 962. All pass, when running on a java 1.6 VM.
These are mainly
existence tests
, not fully exercising the functionality but verifying that each feature exists, is recognized, and works for at least a simple test case.There are new tests for
There's a section for
Special Constructs
that I don't understand and didn't do.brian Wed 17 Jun 2009
Kevin, thank you very much for your patch - I pushed into hg.
At first glance it would appear we have a lot of problems with portability since many of the tests you added fail in .NET:
KevinKelley Wed 17 Jun 2009
Some of those fails surprise me -- the
Groups
,SpecialChars
,Classes
and maybePosixClasses
I'd have guessed would be okay, just from skimming the Windows docs.It's hard to see from this where the problem is; it looks like the line numbers in the stack traces are pointing to the end of the method that failed, instead of to the failing line in the method.
I guess the next thing is to sit down with both runtimes and narrow down (and document!) what the incompatibilities are.
I'll try to spend some time with it, maybe over the weekend, see if we can get a clearer picture of what's what. What we certainly need is at least a documented set of basic functionality that is demonstrated to work the same everywhere. It looks like there is a lot of stuff that does work, but there's a lot of failure/incompatibility mixed in too.
andy Wed 17 Jun 2009
There are outstanding issues with the .NET regex - primarily issues with grouping if I remember. So that may be contributing to some of those issues - haven't had a change to dig in to see for sure yet though.
KevinKelley Thu 18 Jun 2009
Sounds likely... the .net errors start with
testGroups
which I didn't change, and I used theverifyGroups
utility method for quite a few of my tests.Looking at the .net docs, they say they number groups in innermost-leftmost-first order, which I think is Microsoft being cute; Java seems to follow Perl and every other implementation in numbering capture groups by the order of occurrence of the open paren, which I guess would be outermost-leftmost-first order.
Haven't actually verified this is what's going on, but that's how I read it.
Couple days, I'll try to test it out, if nobody beats me to it.
KevinKelley Mon 22 Jun 2009
To follow up, looks like I was wrong above -- group numbering for RE subgroups does seem to work the same in Java and .NET. The failures are coming from a couple inconsistencies -- with initialization, and with what happens on failed matches.
In Java, the data from a RegexMatcher isn't valid until you call
find()
ormatches()
.In .NET, when you get a regex matcher (Match class), it comes pre-initialized. So find() was implemented as a no-op in .NET; that's fine for the first match but leaves you without a way to advance through the string.
For testing I changed the .NET to use a lazy initializer, but really I think the Java side ought to initialize itself (with a find() call in the constructor).
Another thing causing .NET to fail is, in Java (and as documented in Fan), you can get different results from the group methods: an exception if the matcher hasn't been initialized or doesn't match anything; or a null if the matcher's okay but this group doesn't match; or a (possibly empty) string if the group does match.
In .NET there's a Status field, and no exception. So it can be rewritten to check status and throw; but is that really best?
Anyway...
There's a number of other failures, that seem to relate to relate to specific features rather than general usage. Those can, I think, just be documented as depending on the runtime. The general usage stuff -- whether or not you have to call
find()
first; whether exceptions are thrown; etc; needs to be decided and made to work the same on all runtimes.Any thoughts? Is anybody depending on current behavior at all?
brian Mon 22 Jun 2009
We don't use a whole lot of regex in the codebase, so if you have a patch that makes things better with an initial call to find, please email it and I'll take a look.
Although seems like maybe it might be better to use Java's conventions, and fix .NET to match?
KevinKelley Mon 22 Jun 2009
I dislike the Java convention here, because it exposes an object with uninitialized data -- if you forget to call find (or don't know you should) you get exceptions on the group, start, and end methods. That one's bit me several times already.
On the other hand... you don't want to have it looking like the Java convention, but not working like Java does. Also, the semantics (call
find
, then examine what you found; repeat until find fails) seems to call for it working the way Java does it.I think if we were to change how it works on Java side, I'd want to change the method to
findNext
ornext
ornextMatch
, to better express that there's already a first find.For now I'm just going to tinker with the .NET a little bit more, get it to pass as many tests as I easily can.
JohnDG Tue 23 Jun 2009
I discuss these kinds of problems in Part 3 of my Good API Design series. Neither the .NET or Java APIs are unambiguous, both suffer from the design smell of a class acquiring many concerns (mixing the mechanism of finding with the result of finding).
katox Sat 27 Nov 2010
I found an incompatiblity list of Java and .NET regex on StackOverflow.
We haven't touched JS regex target but Google did in GWT 2.1 RegExp. It translates API calls to native JS regex calls under the cover, pure Java implementation uses Java's Pattern class. I think this part of Apache 2.0 Licensed code is something we should reuse.
brian Sun 28 Nov 2010
At this point I am not concerned with Java vs .NET differences. I am only concerned with differences between Java and JavaScript.
I think we can safely say that JavaScript behavior is what Fantom should use. I say this because adding an alternate RegEx implementation to Java is far easier/more efficient that adding an alternate RegEx implementation to JavaScript (where code overhead is a very, very big deal, not to mention a JS implementation vs native will be way slow).
Or alternatively we could just say RegEx isn't completely portable and just use the default implementation of each platform. We are already sort of doing this for Int/Float for efficiency. I would probably vote for this direction. What does everybody else think? Adding an alternate implementation to Java would seriously bloat things like JarDists.
I am not a RegEx power user, so this is an area where we need some community help. How well is the existing implementation working for everybody?
uman Thu 21 Apr 2011
Hi,
I'm late to this discussion, but I created a Regex Support.{xls,html,pdf} file that highlights the differences between .NET, Java and ECMA regex support. It's a cut-down version of the table at http://www.regular-expressions.info/refflavors.html.
I'd like to pass this on the fandev group, so that when crossplatform regex support gets tackled, there's documentation of what differences need to handled. How do I submit the files?
Here's a sample:
brian Fri 22 Apr 2011
When it comes to regex, I'm sort of thinking we don't bother adding a complex layer to guarantee 100% portability, but rather rely on regex engine provided by the platform. I think Java's slavish rule to only supporting a feature if it could be 100% ported to every platform has been a real setback in getting features like chmod, etc into the API. Plus we have already started this direction with the decision to keep Int/Float as true JavaScript Numbers for performance reasons.
If it is simple you can just post diff here. But typically for more complex patches, we do it offline with email.
So let me ask this question? Is there anything missing from current Fantom Regex API that we need to add? If not, then I'd say lets just wrap up what we have for JS and call this resolved (at least for now).
DanielFath Fri 22 Apr 2011
Wouldn't relying on different regex engines cause software incompatibility and difficulty debugging?
ahhatem Fri 22 Apr 2011
I think that what he meant is that he just want to grow the documentation to show the differences between the platforms so that the reader has a reference of what he need to consider when coding, I do not think he want to enhance the regex engine in anyway...
IMHO, it is not that hard to make it compatible, I could just make a different regex for each platform and check on runtime... that is easy enough... Of course, any enhancements are nice, but not a priority...
uman Sat 23 Apr 2011
Yes for documentation, but also no in that only by looking at the differences could we understand if something that is necessary is missing. Moreover, we may be able to spot forthcoming changes to Javascript.
For example, if I look at the columns that show YES for .NET and/or Java and "ascii" for ECMA/javascript, there are a few that I could anticipate will be fixed in later revisions of javascript.
Most of the limits on javascript regexes are quite reasonable, to preserve time and space efficiencies (limits on lookahead, forward references and the like). I rarely use some of the more esoteric features, so for me Javascript regexes are fine.
qualidafial Thu 16 Feb 2012
Ping. I think this ticket is still open only as a reminder to document the differences in regex support by platform.
uman has already put together a list of differences based on http://www.regular-expressions.info/refflavors.html so we could probably just take advantage of his effort there and close out this ticket.
brian Sat 25 Feb 2012
Actually I've left it open because I don't think RegexMatcher is implemented in JavaScript. At this point, I'm going to leave this ticket open until we implement the JavaScript implementation of the existing API we have working in Java.
This would be an ideal project for someone in the community to send us a patch if you are interested in pitching in.
SlimerDude Fri 9 Dec 2016
Given a javascript implementation was given in regex matcher implementation for javascript, can this ticket be closed?