Regular Expression Problems

Buffy · September 25, 2008

Regular expressions are really cool gizmos. Unfortunately, they can be bizarre and counter intuitive. So I thought a thread where people could expose such interesting challenges and solutions could be fun.

Full disclosure, I'm starting this because I stumbled across a problem that was challenged by the fact that I'm really only conversant in "old" RE's, not the new fangled RE standard that has grown up and is included in so many programming tools today, so I'm no expert!

Here's my problem: I'm parsing HTML files, and I want to "eat" (gosh, I tell you I like compilers because so much of the terminology is based on *food*! I put "tokens" in the finite-state-vending-machine and can "consume" the output! :) ), the HTML comments:

Hey! No problem matching that:

""

Well, no, because if you have:

<br/>

The asterisk wild card (*) "matches as much as possible" and will eat the useful tag embedded between the two comments! Yuck!

So the question here is, what is the regular expression that will match all of the comment but terminate at the *first* instance of the close comment string?

A side note: I'm doing this in lex, so I already have a way to do this programmatically, but it still leaves the interesting RE problem that who knows, I might just need some day! :)

The gain I seek is, quiet in the match, :)

Buffy

alexander · September 25, 2008

regular expressions are really meant to match any instance of what you are looking for. in short, you can certainly strip that whole thing of comments without problems. Let me look at it in the morning, what are you using as the program/language to actually facilitate this (i find implementations of regular expressions being faulty more then the expressions themselves, as an example, take postini regular expression match engine... (it sucks because it does not do regular expressions right at all (my experience)))

Buffy · September 25, 2008

Just to clarify: I am using a version of lex, and all of them pretty much conform to the "old, pre-standard" RE syntax that is common in Unix utilities. The one I'm using actually is a PD implementation that produces C# code (gplex), and thus has the modern standards-based (!) implementation of RE's that's in Microsoft's CLR library.

The "* matches as much as possible" is actually specified in the lex specification, and the solution in lex is simply to break up the parsing using states:

"<!--"

_______________

BEGIN INCOMMENT

<INCOMMENT>.

_________

commentstr+=yytext;

<INCOMMENT>"-->"

_____

BEGIN 0; yylval=commentstr; return COMMENTTOKEN;

...which works like a charm, but its not a single RE! :)

Nerds don't just happen to dress informally. They do it too consistently. Consciously or not, they dress informally as a prophylactic measure against stupidity, :)

Buffy

Tormod · September 25, 2008

Two seconds of browsing (just to see if I can help) came up with this:

Finding Comments in HTML Source Code Using Regular Expressions

No idea if it gives you any help. :)

alexander · September 25, 2008

buffy, try putting a question mark (makes the quantifier not greedy) after the quantifier:

""

or you can specify the text that should not be matched like this:

Sign In

Regular Expression Problems

Recommended Posts

Buffy

alexander

Buffy

Tormod

alexander

Join the conversation

Browse

Activity