Kent Posted May 28, 2006 Report Posted May 28, 2006 I have this input file that contain words. I am suppose to scan this text, and save each word into a some data structure( not part of my question). My question is: How do i get the words but ignore the symbols? The text that is given contain symbols like - , : ; ( ) _ - + - ....etc. Here is what i got so far: while( scanf(fpname, %s, stun) !=EOF) ){*************pname= AllocateName( stun);**************** } char* AllocateName( char *stun){char*name;char let;int num;num= strlen(stun);--num;let=stun[num];if(let=='.' ||let==',' || let==':') {stun[num]='o';}if(!(name=(char*)malloc( strlen(stun)+1, sizeof(char)))){printf("problem allocating namen");exit(2);}strcmp(name, stun); return name;} Yes, yes.. It only care for words that ends with a comma, or a period. So it is a not a perfect solution. There can be words like: (log)(base4)(12) <-- consider one word but not this: I have the utter-most-hatred-this-lab, where "utter", "most", "hatred", "this" , "lab" are consider individual words, without the god damn '-'. In other word, if i save "utter-most-hatred-this-lab" as a string as stun, than it must be borken up into some unknown number of pieces individually allocate in the heap! example of the input text: ..........data are used, consisting ...... .................so-called B-trees............. .................................nodes (leaves)................................in O(log(n)) time............................ ......................................(Used in internet routers.)................... The "O(log(n))" is consider to be one word. I am not sure how i should proceed. Quote
Turtle Posted May 28, 2006 Report Posted May 28, 2006 I have this input file that contain words. I am suppose to scan this text, and save each word into a some data structure( not part of my question). My question is: How do i get the words but ignore the symbols? The text that is given contain symbols like - , : ; ( ) _ - + - ....etc. Here is what i got so far: ...example of the input text: ..........data are used, consisting ...... .................so-called B-trees............. .................................nodes (leaves)................................in O(log(n)) time............................ ......................................(Used in internet routers.)................... The "O(log(n))" is consider to be one word. I am not sure how i should proceed. I have no evaluation of your code, however in your example no word begins with a symbol. I think you meant 'parsing' not 'pausing', yes?:eek: Quote
Kent Posted May 28, 2006 Author Report Posted May 28, 2006 I have no evaluation of your code, however in your example no word begins with a symbol. I think you meant 'parsing' not 'pausing', yes?:eek: yes do you have any idea how i should proceed? how can i format the input string so that it will not discriminate against "O(log(n)), and not discriminate " (Used in internet routers.)" where each word inside the braces is a to create as an independent string? Quote
Turtle Posted May 28, 2006 Report Posted May 28, 2006 yes do you have any idea how i should proceed? how can i format the input string so that it will not discriminate against "O(log(n)), and not discriminate " (Used in internet routers.)" where each word inside the braces is a to create as an independent string?Perhaps by refering to a library of the specified symbols & then specifying that what follows them begins a word unless it too is a symbol?:eek: PS That only takes care of the begin of a word!? Quote
Kent Posted May 28, 2006 Author Report Posted May 28, 2006 what language is this? I know C, and a bite of C++( i am taking it this quarter). This lab i am doing is data sturature using c. Quote
Turtle Posted May 28, 2006 Report Posted May 28, 2006 Also, as I see no contractions; you can make the rule that a word contains no spaces, no "-", no ".", etc., but may contain the parenthesis.?? H ope I hlep more than hender.:cup: :eek: Quote
Kent Posted May 28, 2006 Author Report Posted May 28, 2006 One way is to scanf for a string, put it in to "stun". I try to find open braces. and if found, i put push it into a stack. when i spot a closing braces, i pop t from a stack. if the stack is empty, that means i consider the whole god damn stun as one string. If the stack is not empty, i need to only keep the the letters, and discard the open and close braces. Quote
Qfwfq Posted May 29, 2006 Report Posted May 29, 2006 If, by word, you mean consecutive alpha characters, wouldn't it be enough to cycle through the text using a function such as: isalpha(char c){return c >= 'a' && c <= 'Z' && (c <= 'z' || c >= 'A');} to find where a word starts and when it has ended? Of course, if you want "one word" to be one word you can accomodate that to, but I don't see how you could work around cases where something starts with a single " unless you can suppose that it's the last " in the text. Quote
Qfwfq Posted May 29, 2006 Report Posted May 29, 2006 Wait, :) I can see now it isn't so simple, sorry but I had a bit of trouble with the clarity of your posts. So, you want O(log(n)) to be handled as one word but not (Used in internet routers.) or happy-go-lucky, how about O(log(n-m))? Perhaps Turtle is right, if there's a space before the '(' then it isn't like a single word. Wouldn't it be enough to have a simple count, rather than a stack, incrementing at '(' and decrementing at ')' so as to know when things like (log)(base4)(12) or O(log(n-m)) have ended? Quote
Kent Posted May 29, 2006 Author Report Posted May 29, 2006 Wait, :) I can see now it isn't so simple, sorry but I had a bit of trouble with the clarity of your posts. So, you want O(log(n)) to be handled as one word but not (Used in internet routers.) or happy-go-lucky, how about O(log(n-m))? Perhaps Turtle is right, if there's a space before the '(' then it isn't like a single word. By design, there is not word like : O(log(n-m)) or any space between the '(' and the next character. Wouldn't it be enough to have a simple count, rather than a stack, incrementing at '(' and decrementing at ')' so as to know when things like (log)(base4)(12) or O(log(n-m)) have ended? by design, things like: (log)base4)(12) is consider a single word. Quote
Southtown Posted May 29, 2006 Report Posted May 29, 2006 Feel free to check my javascript word counter. http://st10.startlogic.com/~thedawgs/mostuff/southie/extras/CountWords.html The "Count Words" button does just that. But the "Clean Text" button only deletes hard returns to allow natural word wrapping and limits returns between paragraphs to two. It doesn't flag special characters, but it does differentiate in a way you can utilize. ASCII codes. Here's a taste. // --- validate at least one character for ( i = 0; i < text.length; i++ ) { if ( text.charCodeAt( i ) > 32 ) { // --- count first word wordCount++; break; } }So say you were going to pipe each word in a string into individual strings or an array you would do it kinda like: initialize varsloop through string---if charCode > 64 && < 91 || > 96 && < 123 (alpha chars only)------pipe me into var (put sequential alpha chars into string)---else (we have a non-alpha char)------pipe var into array and clear var (save string without this and start over)repeat You could be more specific of course and look for individual chars such as spaces or increment a counter or whatever. Quote
Kent Posted May 29, 2006 Author Report Posted May 29, 2006 if there are words like : (home The '(' would be deleted, and home would be a word to be allocated. If it was : (home) , then : (home) would be one word. Here is my thought( in c) let ary be the string with everything in it. here is my code: 1) set ptr to ary2) loop( *ptr not equal to '0') 2.1) if *ptr is '(', then put it in a stack.2.2) if *ptr is ')' , then pop ')' from the stack2.3) increment ptr by 1.3) end loop 4) if( stack is empty) 4.1) set det to ary.4.2) loop ( *det!= '0') 4.2.1)if(isalpha(*det)) || det*= '(' )4.2.2) strung[ num] = *det 4.2.3) end if4.2.1) increment num4.2.2) increment det4.3) end loop 5) end if Is This a viable way? Is there a better way to do this? Quote
Southtown Posted May 29, 2006 Report Posted May 29, 2006 Oh sorry. I misunderstood your situation. You could go the long route and attempt to assertain the purpose of each '(' or ')'... Or you could just flag the 'delimiters', the characters that will always constitute word breaks, such as spaces. My counter, for example only counts spaces and carriage returns. It would be just as easy, though, to count multiple delimiters at the same time; like spaces, hyphens, commas, and periods. If a character such as a parenthesis does not always act as a delimiter, though, ignore it. The counter would then count ((x)(base10)log(y)) as one word and (this phrase) as two words because of the space alone. Just be careful not to over-increment. I'm no pro, but I would set an incrementor and a flag: wordCount/doCount or similar. (I hate little names, they confuse me.) You would initialize both of these and then create a flip-flop situation. wordCount = 0; doCount = true; loop through string if (char == alphabetical) { if (doCount == true) { wordCount++; doCount = false; } } else if (char == delimiter) { // but ignore parentheses doCount = true; } end loop This will catch double counting. You just need to specify your alpha characters and delimiters to define words and word breaks. wordCount = 0; doCount = true; while ( char = ary.charCodeAt(ptr) ) { if ( char > 64 && // between 'A' and char < 91 || // 'Z' or char > 96 && // between 'a' and char < 123 ) { // 'z' if ( doCount ) { wordCount++; doCount = false; } } else if ( char == 32 || // space or char == 44 || // comma or char == 45 || // hyphen or char == 46 || // period or char == 58 ) { // colon // and so on (ignore parentheses) doCount = true; } ptr++; }This is javascript, though. I don't know c very well, yet. But the same logic would apply. Turtle 1 Quote
Southtown Posted May 31, 2006 Report Posted May 31, 2006 Sorry I misunderstood again. I'm slow. You're not counting words at all are you. The good news is: char is treated as a US-ASCII integer. :phones: Quote
Kent Posted May 31, 2006 Author Report Posted May 31, 2006 well, it turns on the solution is ridiculous simply by "design". I only had to do was to check the front and back of a string for ( and ). Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.