sanctus Posted October 23, 2007 Report Posted October 23, 2007 How is the following possible? 1) I have a running program with the an option, say A, set to false; i.e. A=false; 2) I set A=true;and get an error (it still runs but doesn't calculate because always out of bounds) 3) I set againA=false;and get still the same error as after 2... This is not clear at all to me because between every numbered step (i.e. 1 and 2 and 2 and 3) there is a make clean and then a make...so it shouldn't know anything from the preceeding right? Quote
alexander Posted October 23, 2007 Report Posted October 23, 2007 it does not matter how many times you set a, as long as you are not accidentally redefining it (in which case you will get an error) i can however see a possible problem with you using A in calculations, i would type cast it as int just to be on a safe side. i know posting the whole code may not be possible, but could you perhaps post more pieces that may give us a better picture of what may be happening? Quote
Buffy Posted October 23, 2007 Report Posted October 23, 2007 This is the kind of bug for which a debugger is pretty much invaluable. Have you got one? You need to either have a break on change in value or step through the code starting where its "not the right value." I had a developer not too long ago who thought he was such a hotshot that he could get by without using a debugger, just traces. He did not last long! Step into,Buffy Quote
LaurieAG Posted October 24, 2007 Report Posted October 24, 2007 This is not clear at all to me because between every numbered step (i.e. 1 and 2 and 2 and 3) there is a make clean and then a make...so it shouldn't know anything from the preceeding right? Hi Sanctus, It may be due to your 'A' being a restricted system variable name. Your documentation should provide a list of any restricted variable names. Quote
sanctus Posted October 24, 2007 Author Report Posted October 24, 2007 Actually the code was not written by me, I just modified some parts (nothing though related to that). The code is too complicated to post it here giving a clue...'A' is just a name of an experience to use or not for computing the likelihood...The problem with gdb is that I run this program on different (9) machines at university, so I don't really know how to use gdb. Additionally thing is that looking at the output on the different machines there isn't really an error of compilation but one written in the code (something like "in spline # x out of bounds") so it actually keeps turning but calculates nothing because out of bounds...but why in the beginning x was in bounds and then changing from false to true back to false it is out of bounds (as it was when it was true)? Quote
Buffy Posted October 24, 2007 Report Posted October 24, 2007 The problem with gdb is that I run this program on different (9) machines at university, so I don't really know how to use gdb. Learn it on the one machine that has it installed! You'll be glad you did! Why? Well......there isn't really an error of compilation but one written in the code...Debuggers are what you use when it compiles fine, but theres something in your algorithm that isn't working and you have to go in and watch what's going on in order to find where the flaw is. But if you don't have the time to learn gdb (and really all you need to know is how to set a breakpoint and then inspect the variables that are making you miserable), then you need to "instrument" the app, which basically just means putting "writes" or "alerts" in your code at strategic places and see when x and a are changing. Put these in before and after every reference to them in the code that's not working. Quite often when you have problems like you're describing, its either due to hidden side-effects (something is buried one level down in a function), or you've got scoping errors (x is a local variable that gets set in one place, goes out of scope and then gets recreated but now has no value). If you were using C instead of C++, I'd also be looking for arrays that are going past their declared sizes and munging data in other variables that get allocated nearby on the heap, but you should be getting run-time errors for that (unless you were using an old generic unix "cc"). printf("It hasn't worked for the last %d loops.", i), :xparty:Buffy Quote
alexander Posted October 24, 2007 Report Posted October 24, 2007 g++ -ggdb -o prog prog.cpp gdb prog run :phones: Quote
sanctus Posted October 25, 2007 Author Report Posted October 25, 2007 I know the basics of gdb, so that is not the problem. The problem is that all machines are far away and on mine it takes about 1 hour to get to point where it starts calculating...so I can't try it on the machine that has it installed. Alex to compile the program there is a makefile and I added the thing for gdb in the makefile: FLAGS = -ggdb3 -Wall -DQT_THREAD_SUPPORT -DPRERELEASE $(WMAPDEFINE) //ggdb3 added by me It works fine when not running a montecarlo simulation since then I use my machine. I also tried the following:when you are, via ssh, on the other machines you usually write$ qsub -l nodes=9 -l walltime=500:00:00 ./mcrun to use 9 machines that calculate for 500h. So I tried something like (it calculated so I did it the right way then)$ gdb qsub -l nodes=9 -l walltime=500:00:00 ./mcrun$ run it calculated but since it does not give real errors only try-catch error written by the original coder gdb or not doesn't change much.Now (I said I know a little gdb, but maybe I exarated) how do I set a breakpoint so that the program stops at a given point (i.e. when the catch error comes)? Quote
sanctus Posted October 25, 2007 Author Report Posted October 25, 2007 Now this is what I did: $gdb mc_general (gdb) break fastsplint() Function "fastsplint()" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (fastsplint()) pending. (gdb) exec qsub -l nodes=9 -l walltime=500:00:00 ./mcrun (gdb) run And then it starts and stops doing anything after: Starting program: /usr/bin/qsub Reading symbols from shared object read from target memory...done. Loaded system supplied DSO at 0xffffe000 Any idea? Quote
Qfwfq Posted October 25, 2007 Report Posted October 25, 2007 No doubt you need to watch the rest of the code, especially if the variable isn't declared locally. An option is maybe global, ain't it? ...and get an error (it still runs but doesn't calculate because always out of bounds)...I still don't get your meaning here but if it's giving you an index out of bounds it could well be what Buffy says:If you were using C instead of C++, I'd also be looking for arrays that are going past their declared sizes and munging data in other variables that get allocated nearby on the heap, but you should be getting run-time errors for that (unless you were using an old generic unix "cc").although I disagree that it's a difference between the two languages, it depends on platform, implementation and typically even on compiler options. I've been able to play dirty tricks with index values in C++ too, whereas using pure C on AIX I often get those nasty page faults due to having missed something with pointers. If you can use a debugger on one of the machines, don't forget that you only need to find the cause on one machine (assuming you're not doing odd system dependent stuff that requires different versions). However, Buffy forgive, I don't believe in stepping through before having examined code a bit and reasoned on it. I don't believe in mental laziness, people today are lost if they don't have a pocket calculator to add a couple of two or three digit numbers. However, it can be tough without a debugger. I've done tricky stuff on AIX without one, no Buffy, don't fire me, ain't my fault... One example when I was getting page faults it took me quite a while to find the cause and why the pointer went haywire, and yet the cause was elementary. I had simply forgot to re-initialize it to the base address after having added a new bit before the place where it was already used. When the problem cropped up I forgot the new bit used the mobile pointer in the iteration. It was enough to add ramo = rami and problem solved. Quote
sanctus Posted October 25, 2007 Author Report Posted October 25, 2007 ok, eventually I figured out how to do it, also thanks to:this you start with: $ qsub -l nodes=9 -l walltime=500:00:00 ./mcrun & then you look at the pids on the machine it turns: qstat -an then you ssh to the given machine and with top you find the pids, then you type gdb mc_general pid pid being one of the pids found in topthen as before (gdb) b fastsplint() Function "fastsplint()" not defined. Make breakpoint pending on future shared library load? (y or [n]) y Then the only problem is that the program has already started when I get where I want so I have to add a sleep 1min to ./mcrun and it should be fine. I hope. P.S.: I wrote it here also so that I can come back tomorrow and see how I have it done, but any comments are welcome Quote
alexander Posted October 26, 2007 Report Posted October 26, 2007 drawing blanks here, sanctus, been up for 36 hours, been coding for 24, tired as all hell... code is almost working propperly, i should say its 99.8% working, i shall post it here if people have similar projects..... but seriously this has been the most painful group project i have done in a while, because i had to write all of it, and there are 4 more people in my group... *shakes his head* Quote
sanctus Posted October 26, 2007 Author Report Posted October 26, 2007 doesn't work with sleep because then the progams haven't yet started hence there is no pid...anyone an idea? Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.