Buffy Posted September 25, 2008 Report Posted September 25, 2008 Regular expressions are really cool gizmos. Unfortunately, they can be bizarre and counter intuitive. So I thought a thread where people could expose such interesting challenges and solutions could be fun. Full disclosure, I'm starting this because I stumbled across a problem that was challenged by the fact that I'm really only conversant in "old" RE's, not the new fangled RE standard that has grown up and is included in so many programming tools today, so I'm no expert! Here's my problem: I'm parsing HTML files, and I want to "eat" (gosh, I tell you I like compilers because so much of the terminology is based on *food*! I put "tokens" in the finite-state-vending-machine and can "consume" the output! :) ), the HTML comments: <!-- comment here --> Hey! No problem matching that: "<!--".*"-->" Well, no, because if you have: <!-- comment here --><br/><!-- second comment here --> The asterisk wild card (*) "matches as much as possible" and will eat the useful tag embedded between the two comments! Yuck! So the question here is, what is the regular expression that will match all of the comment but terminate at the *first* instance of the close comment string? A side note: I'm doing this in lex, so I already have a way to do this programmatically, but it still leaves the interesting RE problem that who knows, I might just need some day! :) The gain I seek is, quiet in the match, :)Buffy Quote
alexander Posted September 25, 2008 Report Posted September 25, 2008 regular expressions are really meant to match any instance of what you are looking for. in short, you can certainly strip that whole thing of comments without problems. Let me look at it in the morning, what are you using as the program/language to actually facilitate this (i find implementations of regular expressions being faulty more then the expressions themselves, as an example, take postini regular expression match engine... (it sucks because it does not do regular expressions right at all (my experience))) Quote
Buffy Posted September 25, 2008 Author Report Posted September 25, 2008 Just to clarify: I am using a version of lex, and all of them pretty much conform to the "old, pre-standard" RE syntax that is common in Unix utilities. The one I'm using actually is a PD implementation that produces C# code (gplex), and thus has the modern standards-based (!) implementation of RE's that's in Microsoft's CLR library. The "* matches as much as possible" is actually specified in the lex specification, and the solution in lex is simply to break up the parsing using states:"<!--"_______________BEGIN INCOMMENT<INCOMMENT>._________commentstr+=yytext;<INCOMMENT>"-->"_____BEGIN 0; yylval=commentstr; return COMMENTTOKEN; ...which works like a charm, but its not a single RE! :) Nerds don't just happen to dress informally. They do it too consistently. Consciously or not, they dress informally as a prophylactic measure against stupidity, :)Buffy Quote
Tormod Posted September 25, 2008 Report Posted September 25, 2008 Two seconds of browsing (just to see if I can help) came up with this:Finding Comments in HTML Source Code Using Regular Expressions No idea if it gives you any help. :) Quote
alexander Posted September 25, 2008 Report Posted September 25, 2008 buffy, try putting a question mark (makes the quantifier not greedy) after the quantifier: "<!--".*?"-->" or you can specify the text that should not be matched like this: <!--[^>]*--> Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.