Frequently Asked Questions about the Full-scale Model

 

Question:. Should I average responses by several engineers?

 

Answer: If time does not allow for the engineers to provide evidence of all answers, then averaging the responses is acceptable if you are certain that there are not any global predispositions.  For example, sometimes ALL engineers may be overly optimistic or overly pessimistic.  In that case, averaging will not help much.  Note that organizations are just as likely to be pessimistic as optimistic.

 

Question: Who should answer the questions?

 

Answer: The development practices such as analysis, design, code and unit testing should be answered by at least one software engineer.  Ideally, several people should answer the questions.  The results should be supported by the evidence shown in the below tables.  For example, if someone answers yes to a question then they should be able to provide the evidence as shown in the applicable tables below. 

 

The system and regression testing questions should be answered by the software testers.  The project management and organization factors should be answered by a lead software engineer, software project manager or software manager.

 

Question: How were the points defined for the Full-scale and shortcut models? 

 

Answer: 

  1. Several organization completed a comprehensive survey that includes all of the questions in the full scale model, plus some additional questions.  SoftRel collected the actual fielded defects and the actual fielded size for the new code for the same projects that corresponded with the survey. This collection of data is called a "sample".   We computed the normalized defect density in terms of KSLOC of assembler for each sample in the database.  We also required the same evidence of each answer in the survey to reduce bias in the survey responses.

 

  1. All of the input variables and the normalized defect density response variables were used in several statistical models including linear regression, logistic regression, non-linear regression, General Additive Model, Neural Network.  Predictive models were generated from each type of statistical model and the results compared.  Contrary to what is commonly done in industry, the R squared value was only criteria for determining the best model.   The R squared value shows how well the samples fit the predictive model, however, a common problem with samples that fit the model too well is "over-fitting".  Over-fitting is when then predictive model is able to model idiosyncrasies in a few samples that don't apply to the total population. 

 

  1. The best predictive model is chosen.  In this case it was a non-linear model.  The "points" are simply parameters of the non-linear model that apply to each input variable. Contrary to how other entities have chosen to predict defect density we don't use our "opinions" to determine the point system.  The points are the results of the predictive model that has the highest accuracy.  They are what they are. Keep in mind that the points are related to one and only one thing - defect density.  They are not related to schedule or cost or any other "ility", at least not directly. 

 

  1. The predictive model also defined how the points are accumulated in order to arrive at the predicted defect density.  You can see that the formulas for the shortcut and full scale models are non-linear.

 

Question:  I don't agree with some of the points.

 

Answer:  As indicated in step 3 from above, the points are determined based on the most accurate predictive model. We use fact as opposed to opinion to determine the point systems.

 

Question:  How come the back end (systems testing) of the life cycle process has so many points? 

 

Answer:  As discussed in the previous questions, the points are determined based on fact and not opinion.  Putting that aside, visualize what would happen on a project if the entire systems testing phase was skipped and the testing activities were not otherwise performed earlier in the lifecycle (such as with the clean-room methodology)? You probably visualized a real mess.  There were organizations in this database that chose to skip system testing and the result was exactly what you would expect. 

 

Your intuition is on target however.  The system testing parameters are what filter the very bad defect densities from the average defect densities.  However, the parameters at the front of the lifecycle determined the difference between the average defect densities and the very good defect densities.  So, in summary, think of the system testing parameters as "penalty" measures.  You won't get ahead of the average by doing them, you just avoid getting behind the average.

 

Question:  How come software redundancy is not a parameter?

 

Answer:  Redundancy is not "yet" a parameter because none of the samples in the database employed redundant software on a project.  Please remember that redundant software is NOT the same software on multiple hardware platforms.  Redundant software is the same software developed by more then one company that is required to perform the same function but is supposedly unique because two different companies developed it.

 

Question:  How come a manager that does not code has so many points?

 

Answer:  If your organization is very very small (less then 4 total software engineers), a software manager might be able to code AND manage the other software engineers.  However, on even small software systems there are generally more then 4 software engineers.  If the manager is coding then the manager has less time or is not managing the other software engineers to the level of detail needed. 

 

Question:  How come code inspections has so few points?

 

Answer:  As discussed above, the point system is based on fact and not opinion.  Putting that aside, we believe that the code inspections had a low number of points because of any or all of the below:

 

  1. the organizations followed the procedures for doing inspections, but the criteria for that inspection was not effective (i.e. it found no defects or found defects that were not important).
  2. the organizations followed the procedures for doing inspections, but the majority of the defects were the type that could not be easily found by reviewing the code all by itself.  For example, the defects were in the requirements or the design.
  3. In general, the practices at the beginning and end of the lifecycle had more points then the practices in the middle (coding). 

 

The requirements and design reviews had more points then the code reviews which supports argument b and c from above.

 

Question:  Do you plan to correlate specific brand name UML design tools?

 

Answer:  We correlate types of software tools to defect density.  However, we never correlate brand name software tools.  The reasons are:

 

  1. The features in those brand names may/will change over time making the correlation easily obsolete
  2. We do not intend to have a forum for marketing of any brand name tool
  3. The brand name of the tool is less important for prediction then the features that are automated by that tool - which is what we correlate.

 

Question:  How come programmer skill level is not modeled?

 

Answer:  This is a good question. 

 

  1. All parameters that are included in our study must be measurable objectively.  We have not found a way to accurately measure programmer skill level without some level of subjectivity.  The objective parameters such as college attended, degree earned, and grade-point, were included at one time and did not correlate so they were dropped.
  2. Most software products are no longer developed by individuals working alone, they are developed by groups or teams of individuals.  What correlates in our study is how those individuals function as a team.

 

Question:  How come the points aren't linearly related to the correlations?

 

Answer:

 

The points measure both relationship and impact.  Correlation measures only relationship between a practice and defect density but does not measure impact.  For example, practice x may have a very high correlation to lower defect density, but the magnitude of that lower defect density may be small.  On the other hand, there may be a practice with a weaker correlation but higher impact.  This practice doesn't always produce lower defects, but when it does, the difference is very measurable.  Ideally, it would be nice to have the practices in place that have both high impact and high correlation so as to minimize the risk and maximize the returns of implementing that practice.

 

Question: It seems to me that all I have to do is find the items with the highest points that have a no answer and implement them?

 

Answer:  Not necessarily.

 

The items with the highest points may also be expensive, take calendar time to implement and may have prerequisites that cannot be resolved easily by your organization.  Sometimes having a few items with a moderate amount of points is the fastest and cheapest way. Frestimate now has a cost model to allow you to see the expense, difficulty and prerequisites involved with making improvements.

 

Question: Does SoftRel have plans to add more application types?

 

Answer: Yes, we do this continually.  If you send us a completed Frestimate prediction with actual responses for the SoftRel Full-scale model and include the actual observed defects and size estimates and the application type, we will include that data in the model.  You can forward a non-disclosure for signature prior to sending data as well.  We keep the sources of all data confidential.

 

Question: How come some industries have higher defect densities then others?

 

Answer:. The answer real boils down to how many hours the software must operate in the field without interruption or service.  The longer this requirement, the smaller the average defect densities are likely to be.  The criteria that determine how long the software must operate without interruption may include:

1. how easy is the software to service/platform?  For example, the serviceability of the software in a dishwasher is different then a satellite.  The dishwasher would require either a maintenance call or a phone line/remote communications to service the software while the satellite requires remote communications for uploading/downloading. 

2. how many units containing the software must be supported? There are many more dishwashers then satellites.  One maintenance call for a dishwasher multiplied by the total number of dishwashers might result in a maintenance nightmare if the fielded defect density is not low enough.

3. how visible is the impact from a software defect.  You can see that for the scientific software, the defect density was very small.  This is because for this type of application, the outputs can simply not be incorrect as just one incorrect output would result in a user no longer having confidence in the software as a whole.