Software Reliability Brainstorming

Pyrotex · April 17, 2009

Hey all you computer geeks out there! Achtung!

I am faced with this really huge issue and I need all the help I can get from other folks who have software backgrounds and experience. I would like this thread to be where we can brainstorm and bring together any ideas or authoritative knowledge available.

This issue is this: What is "Software Reliability"?

What is the difference, for example, between Software Reliability and Software Quality?

How do you measure Software Reliability?

How do you predict Software Reliability?

Can you attach such engineering concepts as Mean Time To Failure (MTTF) to Software?

Let's assume "real-time" software, the kind of stuff that interacts with an essentially "random" and continuous input stream and the code has to respond to that input immediately. For example, a flight simulator, or an operating system.

The problem is (as you code weenies will immediately see) that software (SW) doesn't "break"; it doesn't "wear out"; it doesn't "wear" at all. Not in the sense of hardware devices like valves and gears. An algorithm that (example) calculates the sine of an angle will NEVER one day give a wrong answer after years of being correct.

:thumbs_up ("my electrons got tired")

I'm NOT talking about buggy SW or SW slopped together by an engineer who should know better. I'm talking SW that has been designed, architected, written, tested and verified by professionals. It is as "perfect" as the professionals can make it. Now...what is its "reliability"? How likely is it to "fail"? What do I mean by "fail"? If that SW has to run correctly, continuously & unmaintained, for say 10,000 hours, what is the probability that it will do so? 99%? 99.99%? How do you calculate that?

Quotes from respectable sources are welcome.

Brainstorming encouraged.

Wild ideas and anecdotal stories welcome.

phillip1882 · April 17, 2009

to me software reliability refers to how well software operates on bad input.

if i give the sin function a string, will it crash my program, or just tell the user, "hey, you can't do that!"?

the ability of software to handle problem input gracefully is reliable software. quality software is software that does the job that is required well. if you need a program that plays pong for example, quality software might have the paddles follow the mouse, as well as allow keyboard use. reliable software insures that if you move the mouse off the screen or press the up/down key a little too long, that the paddle won't move off the screen.

Buffy · April 17, 2009

Going beyond Phillp's point: it's not just the input to the software, it's the errors that can be caused by errors in the operating system. You can test your inputs all you want, but you can end up with conditions where, for example, your stack or heap data becomes corrupted. Software testing its already-captured data is almost non-existent in most programming practice, even though operating system support for such errors is still weak--and thus provides the vast majority of holes that can be exploited by viruses, Trojans and other evil code.

I think "software quality" is just a warmer, fuzzier term that allows you to get away with ignoring the quantitative component of software reliability.... :)

Gain a modest reputation for being unreliable and you will never be asked to do a thing, :thumbs_up

Buffy

Pyrotex · April 21, 2009

Trouble is, the word "reliability" has a well defined meaning in engineering. Reliability means literally the probability that a component will perform its designed function for a specified length of time under specified constraints, environments or boundary conditions.

For example, the reliability of a Mark VII-a 100 watt ventilation fan, in a Leopard Battle Tank, for 1,000 hours continuous operation under desert warfare conditions, is 96.2% [let's say]

Reliability is an empirically established probability of success.

How did they get that number? They took 1,000 Mark VII-a 100 watt ventilation fans, and put them through 1,000 hours of simulated conditions in a laboratory, and 962 of them were still running at the end of the test. 38 failed at some point during the test. That's a 3.8% failure rate. That's a 96.2% reliability rating for 1,000 hours under defined conditions.

Or, you could have looked at the Leopard tanks in Desert Storm. Say there were 100 of them, they were in battle ops for 100 hours each (average), giving a total of 10,000 hours. Given a reliability of 96.2%, one would expect about 0.4 fans to be replaced. Actually, 1 fan was replaced, well within 80% confidence bounds.

This is how Injuneers calculate reliability.

How does one go about calculating the reliability of a software application?

If you have run a million copies on a million PCs, then over time, you can say with confidence that, only 85 application crashes were reported. Do the math.

But what if you have spent years building the application and it is to be executed under actual field conditions ONLY ONCE? Say, the application's function is to manage a 450 kilogram spacecraft through de-orbit, re-entry and soft landing on Mars?

How now do you calculate reliability, brown cow? ;)

Buffy · April 22, 2009

Well, to start off, "software engineering" is an oxymoron. The two words should never appear together in the same paragraph. :)

Secondly, I think you make a big mistake by trying to bring in a hardware component to "run a million copies on a million PCs" because what you're measuring there is a *combination* of the software and the hardware. If you're a whiz with Bayesian Probability, you can probably disambiguate the two (probably only with twice the number of trials required), but you're introducing an unknown additional uncertainty.

Now the thing about software is that it's not a thing, it's a concept. As such, it's entirely deterministic, does not wear out over time, get tired or have a headache.

Thus the only thing that will make it "fail" is input that it wasn't written to properly respond to.

With a light bulb, you just turn it on and off and on and off and on and off and it will eventually fail. With software, once it's properly handled a specific input stream, you know it always will, by definition. (This is also why you go horribly wrong if you "run on a million PC's because you're ONLY measuring the hardware reliability if you give it an input that is known to work).

SO what you're left with is mainly just a combinatorial problem: you gotta generate all those input combinations and have enough time in eternity to test them all! :evil:

Well, obviously, that could be a problem, eh? :rant:

Now one thing you can do is break down the problem: unit testing on all the individual procedures will give you an idea of the reliability of the individual pieces, and while you can't fully test each piece, you can assign to each an amount of "potential unreliability" that is due to it's inherent complexity--another way of saying "too many input combinations."

Software is rarely "hardened". If you claim that you don't make assumptions about your inputs, you're either lying, clueless or work for NASA (c'mon, you *know* it's true!). But to the extent that you can show that various components are indeed hardened--that is you can really test *all* inputs--you can reduce the percentage of code that is subject to bugs.

Now it's also true that the combinatorial input problem is, well, computationally prohibitive, but there's a funny thing about inputs: they can be categorized. Moreover, bugs almost always occur with inputs that are at "boundary values": you expect 0-n and you get -1. Boom.

So when you think about it, "software reliability" is really a measure of those inputs that the program can't handle. You know you code properly for expected values, and most testing eliminates errors in the input ranges that are "expected."

So what you're left with is the probability that you'll get an unexpected input *and* the software can't handle it (you need to recognize the dumb luck when the input is bad but the code just happens to be impervious to it. Happens to me all the time.).

Thus, where you want to start is to figure out how you can categorize that input to estimate the "holes" that cause the unexpected to occur because the input--unfortunately, in combination--is wrong.

So we'll leave that as an exercise for the reader....

1202, :)

Buffy

Pyrotex · April 22, 2009

Well, to start off, "software engineering" is an oxymoron. The two words should never appear together in the same paragraph. :P
...1202, :)

Well, that may make sense to YOU, but only because you have been writing software for just, like, decades. AMOF, so have I. But I am constrained by NASA policy to believe that "software engineering" is in fact what we do, and then behave accordingly. ;) Thence my sudden interest in determining what it means to speak of "software reliability" in quantitative terms.

1202? Aha! The year AD that Leonardo Fibonacci published Liber Abaci, introducing the Arabian zero to Europe, and thereby giving computer science the other number it needed to get started, and (of course) resulting in the very first digital computer software bug! Which YOU, no doubt, discovered! :) Thank you!

Buffy · April 23, 2009

But I am constrained by NASA policy to believe that "software engineering" is in fact what we do, and then behave accordingly. :) Thence my sudden interest in determining what it means to speak of "software reliability" in quantitative terms.

Thought about this a bit more. It ended up in being a joke:

Q: What's the difference between a Programmer and a Software Engineer?

A: A Software Engineer is certain that 1 + 1 = 2 only 99.999% of the time.

There's got to be something different about the approach if any of this is going to go anywhere. These aren't light bulbs, or even a Saturn V.

1202? Aha! The year AD that Leonardo Fibonacci published Liber Abaci, introducing the Arabian zero to Europe, and thereby giving computer science the other number it needed to get started, and (of course) resulting in the very first digital computer software bug!

Aw, so cute that you're pretending not to recognize the reference... you are pretending right? :)

You can not operate in this room unless you believe that you are Superman, and whatever happens, you're capable of solving the problem, :P

Buffy

Pyrotex · April 23, 2009

...Aw, so cute that you're pretending not to recognize the reference... you are pretending right? :eek:

Aw, so cute that you think I wouldn't know about Microsoft error #1202, failure to find a security signature file.

But did you fail to realize that my "reference" to Fibonacci was historically correct? Is it just a coincidence that the error number and the date "zero" was introduced to Europe, are the same? :eek: I thought that was rather clever, proving that I'm a programmer, not a true software engineer. I'm only pretending to be a SWE, for the paycheck, the fame, and the wild parties. :lol:

Sign In

Software Reliability Brainstorming

Recommended Posts

Pyrotex

phillip1882

Buffy

Pyrotex

Buffy

Pyrotex

Buffy

Pyrotex

Join the conversation

Browse

Activity