Sunday, March 17, 2013

Manage Code Quality Using Sonar

Target audience: Beginner
Estimated reading time: 4'



Table of contents
Follow me on LinkedIn

Introduction

It is fair to say that setting up and maintaining a source code quality management is not on the top of priorities list of developers and managers alike. Despite all the good intentions, costs and/or time to market relegate the tasks of setting up a code analyzer for instances and enforcing coding best practices to the back burner. Unfortunately, those tools have gotten the reputation to be difficult to maintain and costly to license. This is not the case anymore.
  
Sonar is an open source Platform used by development teams to manage source code quality.  The main purpose of the project is to simplify code quality management.
As such, Sonar supports analysis of Java in the core, but also up to 10 different programming languages through plug-ins. The automation process associated to code quality management can be broken down along two distinct objectives:
  • Continuous integration: Build automation, static code analysis & code coverage reports
  • Continuous reviews: Best practices violations, code improvement and refactoring, stability and maintainability
Cyclomatic complexity  originally developed by Thomas McCabe, directly measures the number of linearly independent paths through a program's source code.

Readability of source code is quite often evaluated  using Fresch-Kincaid test that was originally developed for measuring the readability of academic English. The method scores the "complexity" of any document from 100 (11th grade student) to 0 domain-experts and scholars, with 50 for articles in "Time Magazine".


Dashboard

The purpose of the dashboard is to provide an overview of the static code analysis. The following report shows the analysis of the java implementation in Jena, an open source, Apache licensed, library to create and manage semantic database and RDF tuples. A devOps or manager uses Sonar to answer some basic questions before drilling down into specific problem areas.

  • Duplication: What is the percentage of redundant code that needed to be refactored, eliminated? Is the redundant code caused by poor design, legacy code?
  • Code coverage: What is the percentage of execution paths is exercised through unit test? Is poor coverage associated with violation of coding best practices.
  • Rules compliance: What are the most common violation or deviation from standards?
  • Code complexity: How difficult to maintain, modify & extend the current code base?
The dashboard can be easily upgraded using custom filters and layout.


Best practices violation

One of a "side" benefit of any static code analyzer is to force the engineering to define and maintain a set of best practices.  Sonar uses a severity ranking similar to most common defect database of  classify violation of coding standards. The following table display the severity and type of violation of best practice rule.


By default, the current version of Sonar, 3.0, contains 600 coding off-the-shelf rulesThe user can create custom rules or override existing ones using XPath expressions.
Inner Assignment
checkstyle : com.puppycrawl.tools.checkstyle.checks.coding.InnerAssignmentCheck
Checks for assignments in subexpressions, such as in String s = Integer.toString(i = 2);

Avoid Throwing Raw Exception Types

Avoid throwing certain exception types. Rather than throw a raw RuntimeException, Throwable, Exception, or Error, use a subclassed exception or error instead.

Sonar allows the developers to take a snapshot of the overall quality of an application and view the evolution of quality measures across snapshots with the TimeMachine service. But this was not sufficient to provide at quick answers to the fundamental question: What changed over the past X days?
Recently, Sonar added the differential dashboard that allows developers to visualize the difference between two quality snapshots. Those quality snapshots can be assigned a version,  and purged according to a configurable policy.

Getting started

Sonar is made of 3 components:
  • Database that stores the configuration and results of quality analyses  
  • Web Server that is used to navigate the results of the analyzes and make configuration
  • Client that runs source code analyzers to compute data on projects
    Examples of static code analysis using Sonar are available on GitHub Sonar GitHub

    Installation of Sonar

    Here are the set-up to install Sonar
       1. Download the latest version of Sonar from Sonar Downloads
       2. Unzip the installation package
       3. In the directory sonar-xx/conf open and edit the Sonar property file
       4. Override the credentials

    sonar.jdbc.username: sonar  
    sonar.jdbc.password: sonar
       5. By default, Sonar is bundled with Apache Derby database
             sonar.jdbc.url: jdbc:derby://localhost:1527/sonar;create=true
             sonar.jdbc.driverClassName: org.apache.derby.jdbc.ClientDriver

       6. If you want to use your own database you need to create the database and 
           relevant tables
       7. Then specify the JDBC drivers URL and name of your database

    sonar.jdbc.url: jdbc:oracle:thin:@localhost:1521/instance-name  
    sonar.jdbc.driverClassName: oracle.jdbc.driver.OracleDriver 
    or
    sonar.jdbc.url:jdbc:mysql://localhost:3306/sonar?useUnicode=true&characterEncoding=utf8 sonar.jdbc.driverClassName:com.mysql.jdbc.Driver

    Installation Eclipse plug-in

    Assuming Eclipse version 3.7 or later is installed...
    1. Go to Help > Install New Software... This should display the Install dialog box. 
    2. Paste the Update Site URL (http://dist.sonar-ide.codehaus.org/eclipse/) into the field Work with and press Enter. This should display the list of available plugins and components: 
    3. Check the component you wish to install. 
    4. Click Next. Eclipse will then check to see if there is any issue which would prevent a successful installation
    5. Click Finish to begin the installation process. Eclipse will then download and install the necessary components. 
    6. Once the installation process is finished, Eclipse will ask if you want to restart the IDE. It is strongly recommended that you restart the IDE.

    Integration with Jenkins

    Assuming that Jenkins continuous integration server is already installed, you need to 
    1. login into the Jenkins installation 's management screen
    2. click on Manage Plugins  menu
    3. select the Available tab. If the list of plug-ins is empty, then select Advanced tab and force to check for new updates by clicking Check now.
    4. select the check box corresponding to the Sonar plug-in for a particular language
    5. select the Install without restart option. The installation will complete

    References

    Tuesday, February 19, 2013

    Scrum & Distributed Teams

    Target audience: Beginner
    Estimated reading time: 3'

    Table of contents
    Follow me on LinkedIn

    Overview

    The power of Scrum is about instant decisions and collaborative work within the team and with the product owner.  Most of the books and articles on the subject assume, indirectly, that the entire team sits in the same building or vicinity. Unfortunately, an increasing number of software companies have either engineering teams spread around the world, or allow telecommuting or both. Such organizations create challenge for management and more specifically for the Scrum master.

    Challenges

    • Different culture or ethics. Some culture have different rules about accessibility outside normal business hours and privacy. Most of western Europe has very strict labor regulations that may have an impact of the quality of the collaboration between teams.
    • Language barrier. Beside the obvious challenges of accommodating different regional accents, people are sometimes hesitant to ask the speaker to repeat himself or herself, leaving the listener guessing and making inaccurate assumptions. 
    • Confusing communication.  Companies with distributed teams offer multiple channels of communication in order to improve collaboration. However this strategy is not without risk as the same message, report or request may differ between channels of communication.  
    • Difficulty to build team Considering Scrum principles focus on high collaboration, it is quite a challenge to build the team dynamic and motivation. 
    • Interaction too formal. Team spirit and commitment is built through informal interaction. Such tactic is certainly more difficult to implement within a distributed team than a group of engineers sharing the same office.

    Tools

    The last few years have seen a significant increase of the number of options available to facilitate collaboration across continents. 

    1. Video-conferencing: This approach is more suited for weekly status meeting, spring planning and retrospective. Skype is a low cost solution to create a bond between the team members and reduce noise in communication. Managers can observe the mood of the teams through non-verbal communication.

    2. Instant Messaging
    I have found Instant Messaging is perfectly suited for quick one-on-one exchange between engineers.  It is foremost a very effective tool because it allows non-formal, abbreviated, short messages with link to document, source code,  test cases or meeting minutes to be reviewed by a peer. Some IM solutions such as Meebo include extra features that are make the archiving and search through messages easier.

    3. Sharing Documents
    Although Scrum is not conducive to large amount of documentation, there is always a need for functional specifications, design documents or test plan to be drafted and shared. DropBox or Google Drive provide a simple and effective platform for engineers to share and update documents, especially when used in conjunction with a micro blog. Once a document has been updated, reviewed and approved, it can be converted into a Wiki page

    4. Micro blogs
    As long at the are used judiciously, Micro blogs such as Twitter are a great way to  notify a group of engineers or the entire team of changes in procedures, summary of daily sprint stand-up, update status or announce meetings or corporate events. However management needs to monitor the content, tone and frequency of those posts.  Employees because insensitive and unresponsive because they are constantly bombarded with large numbers of messages.

    5. Wiki
     Wiki are still the best medium for well-defined and complete documents such as coding standard, list of user stories for an incoming sprint, minutes of a retrospective Scrum meeting.

    6. Email
    From my personal experience, email is an overused and abused communication medium. It should be reserved for formal and/or confidential information

    7. Meetings
    Because of time zone differences, we cannot expect everyone from any part of the world to attend every meetings. Meeting should be run effectively with a clearly defined agenda, time limit and detailed minutes so engineers do not feel compelled to drop more critical tasks to attend a meeting because of the fear to miss on important information. Meetings should be restricted to evaluate proposal, make recommendation, bring different point of view, and eventually make decisions.

    Management

    1. Managing communication
    As I mentioned earlier, some of the communication tools can be easily abused or misused. It is the responsibility of the manager to monitor any team or company wide communication for sake of legal, ethical implication as well as the overall productivity of the engineering group.

    2. Managing collaboration
    Some tasks such as peer programming require to have  two or more engineers collaborate on a specific problem. I believe manager are responsible to organize schedule and set business hours of different teams to overlap to facilitate technical collaboration and brainstorming.


    References


    Monday, January 7, 2013

    Naive Bayes Classifier in Java

    Target audience: Advanced
    Estimated reading time: 5'



    Table of contents
    Follow me on LinkedIn

    Introduction

    The Naive Bayes approach is a generative supervised learning method which is based on a simplistic hypothesis: it assumes that the existence of a specific feature of a class is unrelated to the existence of another feature. This condition of independence between model features is essential to the proper classification.
    Mathematically, Bayes' theorem gives the relationship between the probabilities of A and B, p(A) and p(B), and the conditional probabilities of A given B and B given A, p(A|B) and p(B|A).
    In its most common form the Naive Bayes Formula is defined for a proposition (or class) A and evidence (or observation) B with \[p(A|B)= \frac{p(B|A).p(A)}{p(B)}\]
       - p(A), the prior, is the initial degree of belief in A.
       - p(A|B), the posterior, is the degree of belief having accounted for B
       - p(B|A)/p(B) represents the support B provides for A
    The case above can be extended to a network of cause-effect conditional probabilities p(X|Y).


    In case of the features of the model are known to be independent. The probability of a observation x =( ...,x,...) to belong to a class C is computed as: \[p(C|\vec{x})=\frac{\prod (p(x_{i}|C).p(C)}{p(\vec{x})}\]. It is usually more convenient to compute the maximum likelihood of the probability of a new observation to belong to a specific class by converting the formula above. \[log\,p(C|\vec{x}) = \sum log\,p(x_{i}|C) + log\,p(C) - log\,p(\vec{x})\]

    Note: For the sake of readability of the implementation of algorithms, all non-essential code such as error checking, comments, exception, validation of class and method arguments, scoping qualifiers or import is omitted.

    Software design

    The class in the example below, implements a basic version Naive Bayes algorithm. The model and its feature is defined by the nested class NClass. This model class defines the features parameters (mean and variance of prior observations) and the class probability p(C).  The computation of the mean and variances of prior is implemented in the NClass.computeStats method.Some of the methods, setters, getters, comments and conditional test on arguments are omitted for the sake of clarity. The kernel function is to be selected at run-time. This implementation supports any number of features and classes.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    public final class NaiveBayes implements Classifier {
      
      public final class NClass {
        private double[] _params         = null;
        private double[] _paramsVariance = null;
        private double   _classProb = 0.0;
     
        public NClass(int numParms) { 
          _params = new double[numParams];  
        }
     
        private void add(double[] data) {
          int numObservations = 0;
                 
          _paramsVariance = new double[_params.length];
          for(int j = 0; j < data.length; ) {
            j++;
            for( int k = 0; k < _params.length; k++, j++) {
              _params[k] += data[j];
              _paramsVariance[k] += data[j]*data[j];
            }
            numObservations++;
          }
          _classProb = numObservations;
        }
     
        private void computeStats() {
          double  inv = 1.0/_classProb;
          double  invCube = invClassProb*invClassProb*invClassProb;
     
          for( int k = 0; k < _params.length; k++) {
            _params[k] /= _classProb;
            _paramsVariance[k] = _paramsVariance[k]*inv -
                       _params[k]*_params[k]*invCube;
          }
          _classProb /= _numObservations;
        }
      }
    }
    

    Kernel functions can be used to improve the classification observations by increasing the distance between prior belonging to a class during the training phase. In the case of 2 classes (Bernoulli classification) C1, C2 the kernel algorithm increases the distance between the mean values m1 and m2 of all the prior observations for each of the two classes, adjusted for the variance.


    As Java as does not support local functions or closures we need to create a classes hierarchy to implement the different kernel(discriminant) functions. The example below defines a simple linear and logistic (sigmoid function) kernel functions implemented by nested classes. \[y = \theta x \,\,and\,\,y =\frac{1}{1+e^{-x}}\]

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    public interface Discriminant {
       public double estimate(double value);
    }
           
        //Nested class that implement a linear Discriminant 
    public static class DiscriminantKernel 
                  implements Discriminant  {
       private double _theta = 1.0;
       public DiscriminantKernel(double theta) { 
         _theta = theta; 
       }  
       public double estimate(double value) { 
         return value*_theta; 
       }
    }
                 
           // Nested class that implements a sigmoid function for kernel
    public static class SigmoidKernel implements Kernel {
      public double estimate(double value) { 
        return 1.0/(1.0 + Math.exp(-value) 
      }
    }
    

    Ultimately, the NaiveBayes class implements the three key components of the learning algorithm:
    • Training: train
    • Run time classification: classify
    A new observation is classify using the logarithmic version of the Naive Bayes formula, logP
    First let's define the NaiveBayes class and its constructors.


     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    public final class NaiveBayes implements Classifier {
    
       public final class NClass { }
    
       private CDoubleArray[] _valuesArray = null;
       private NClass[] _classes = null;
       private int _numObservations = 0;
       private int _step = 1;
       private Kernel _kF = null;
          
       public NaiveBayes() { this(0,0) }
       public NaiveBayes(int numParams, int numClasses) { 
         this(numParams, numClasses, new NLinearDiscriminant());
       }
                
       public NaiveBayes(
          int numParams, 
          int numClasses, 
          final Discriminant kf
       ) {
         _classes = new NClass[numClasses];
         _valuesArray = new CDoubleArray[numClasses];
     
         for( int k = 0; k < numClasses; k++) {
           _classes[k] = new NClass(numParams);
           _valuesArray[k] = new CDoubleArray();
         }
         _kF = kf;
         this.discretize(0,numClasses);
      }
       ..
    } 
    

    Training

    Next the training method, train is defined. The method consists merely in computing the statistics on historical data, _valuesArray and assign them to predefined classes _classes

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    public int train() throws ClassifierException {
      double[] values =  null;
               
      for( int j = 0; j < _valuesArray.length; j++) {
        values = _valuesArray[j].currentValues();
        _classes[j].add(values);
      }
               
      for( int j = 0; j < _classes.length; j++) {
        _classes[j].computeStats();
      }
      return values.length;
    }
    

    Classification

    The run-time classification method classify uses the prior conditional probability to assign a new observation to an existing class. It generate the class id for a set of values or observations.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    public int classify(double[] values) {
               
       // Compute the normalizing denominator value
      double[] normalizedPriorProb = new double[values.length],  
               prob = 0.0;
    
      for( int valueIndex = 0; valueIndex < values.length; valueIndex++) {
    
        for(int classid = 0; classid < _classes.length; classid++) {
          prob = Math.abs(values[valueIndex] - 
              _classes[classid]._parameters[valueIndex]);
          if( prob > normalizedPriorProb[valueIndex]){               
              normalizedPriorProb[valueIndex] = prob;
          }
        }
      }
      return logP(values, normalizedPriorProb);
    }
    

    A new observation values is assigned to the appropriate class croding to its likelihood or log of conditional probability, by the method logP.
    logP computes the likelihood for each value and use the Naive Bayes formula for logarithm of prior probability and log of class probability


     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    private int logP(double[] values, double[] denominator) {
      double score = 0.0, 
             adjustedValue = 0.0, 
             prior = 0.0,
             bestScore = -Double.MAX_VALUE;
      int bestMatchedClass = -1;
                    
      // Walk through all the classes defined in the model
      for(int classid = 0; classid < _classes.length; classid++) {
        double[] classParameters = _classes[classid]._parameters;
                         
        score = 0.0;
        for( int k = 0; k < values.length; k++) {
           adjustedValue = _kF.estimate(values[k]);
           prior = Math.abs(adjustedValue - classParameters[k])/
                   denominator[k];
           score += Math.log(1.0 - prior);
        }
        score += Math.log(_classes[classid]._classProb);
                        
        if(score > bestScore) {
            bestScore = score;
            bestMatchedClass = classid;
        }
      }
      return bestMatchedClass;
    }
    

    Some of the ancillary private methods are omitted for the sake of clarification. We will look at the implementation of the same classifier in Scala in a later post.


    References

    • The Elements of Statistics Learning: Data mining, Inference & Prediction - Hastie, Tibshirani, Friedman - Springer
    • Machine Learning for Multimedia Content Analysis - Y. Gong, W, Xu - Springer
    • Effective Java - J Bloch - Addison-Wesley