Jump to content
Science Forums

Recommended Posts

Posted

Regular expressions are really cool gizmos. Unfortunately, they can be bizarre and counter intuitive. So I thought a thread where people could expose such interesting challenges and solutions could be fun.

 

Full disclosure, I'm starting this because I stumbled across a problem that was challenged by the fact that I'm really only conversant in "old" RE's, not the new fangled RE standard that has grown up and is included in so many programming tools today, so I'm no expert!

 

Here's my problem: I'm parsing HTML files, and I want to "eat" (gosh, I tell you I like compilers because so much of the terminology is based on *food*! I put "tokens" in the finite-state-vending-machine and can "consume" the output! :) ), the HTML comments:

 

<!-- comment here -->

 

Hey! No problem matching that:

 

"<!--".*"-->"

 

Well, no, because if you have:

 

<!-- comment here --><br/><!-- second comment here -->

 

The asterisk wild card (*) "matches as much as possible" and will eat the useful tag embedded between the two comments! Yuck!

 

So the question here is, what is the regular expression that will match all of the comment but terminate at the *first* instance of the close comment string?

 

A side note: I'm doing this in lex, so I already have a way to do this programmatically, but it still leaves the interesting RE problem that who knows, I might just need some day! :)

 

The gain I seek is, quiet in the match, :)

Buffy

Posted

regular expressions are really meant to match any instance of what you are looking for. in short, you can certainly strip that whole thing of comments without problems. Let me look at it in the morning, what are you using as the program/language to actually facilitate this (i find implementations of regular expressions being faulty more then the expressions themselves, as an example, take postini regular expression match engine... (it sucks because it does not do regular expressions right at all (my experience)))

Posted

Just to clarify: I am using a version of lex, and all of them pretty much conform to the "old, pre-standard" RE syntax that is common in Unix utilities. The one I'm using actually is a PD implementation that produces C# code (gplex), and thus has the modern standards-based (!) implementation of RE's that's in Microsoft's CLR library.

 

The "* matches as much as possible" is actually specified in the lex specification, and the solution in lex is simply to break up the parsing using states:

"<!--"
_______________
BEGIN INCOMMENT

<INCOMMENT>.
_________
commentstr+=yytext;

<INCOMMENT>"-->"
_____
BEGIN 0; yylval=commentstr; return COMMENTTOKEN;

 

...which works like a charm, but its not a single RE! :)

 

Nerds don't just happen to dress informally. They do it too consistently. Consciously or not, they dress informally as a prophylactic measure against stupidity, :)

Buffy

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...