Showing posts with label Discounted Cumulative Gain. Show all posts
Showing posts with label Discounted Cumulative Gain. Show all posts

Friday, April 8, 2022

Discounted Cumulative Gain in Scala

Target audience: Advanced
Estimated reading time: 5'

Numerous real-life applications of machine learning require the prediction the most relevant ranking of items to optimize an outcome. This article illustrates the Normalized Discounted Cumulative Gain (NDCG) and its implementation in Scala.


Table of contents
Follow me on LinkedIn

What you will learn:  How to implement the Discounted Cumulative Gain to assess the effectiveness of a search algorithm or recommendation system


Notes:
  • Environments: Scala 2.12.11
  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Discounted Cumulative Gain

Discounted Cumulative Gain (DCG) and its normalized counter part, Normalized Discounted Cumulative Gain (NDCG)  is a measure used in information retrieval and ranking system evaluation to assess the effectiveness of a search algorithm or recommendation system. DCG is particularly useful for situations where the items are ranked in order of relevance or importance and the value of each item diminishes as one goes down the list of search results.

Here are examples of real-life applications that benefit from NDCG:
  • Evaluate and prioritize counter-measures to cyberattack.
  • Ranks symptoms in a clinical trial.
  • Extract documents relevant to a given topic from a corpus.
Let's dive into the mathematical formalism for the Discounted Cumulative Gain. 


For an indexed relevance tj as illustrated in the diagram above, the discounted cumulative gain at position n is computed as \[DGC_{j}=\frac{2^{t_{j}}-1}{log_{2}(j+1)} \ (1)\ \ \ \ DGC=\sum_{j=1}^{n} DGC_{j}\ (2) \ \ \]
The objective is to compare any given list of ranked/sorted item with a benchmark which represent the optimum ranking (ranking with the highest DCG value).  \[NDGC=p(ranking | IDGC) = log(2) \frac{DGC}{IDGC}\ \ (3) \] Each item in the ranked list is assigned a relevance score, typically a positive real number, with higher scores indicating more relevant items. The relevance score of each item is discounted logarithmically based on its position in the list. This discounting reflects the notion that users are less likely to consider items as they move down the list, so items appearing earlier have more impact.


Scala implementation

Computation

The implementation of the computation of NDCG in Scala is quite simple, indeed. Given a ranked list of items. The three steps are

  1. Compute the IDCG (or normalization factor)from the list
  2. Compute the DCG for the list
  3. Compute the NDCG = log(2).DCG/IDCF
First let's consider list of items, of type T to rank. The method ranking to sort a sample of sorted items is provided as an implicit function. The constructor for NDCG has a single argument: the sample of ranking: 

class NDCG[T](initialSample: Seq[T])(implicit ranking: T => Int) {
   import NDCG._

   val iDCG: Double = normalize

   def ndgc: Double = score(initialSample)

    // Implement Formula (3)
   def ndgc(samples: Seq[T]): Double =
       if( samples.size != initialSample.size) 0.0 
       else Math.log(2)*dcg(samples)/iDCG

   private def normalize: Double = {                                // 1
      val sorted = initialSample.zipWithIndex.sortBy{       // 2
         case (sample, n) =>  -ranking(sample)
       }.map( _._1)

       dcg(sorted)
   }

    // Implement Formula (2)
   private def dcg(samples: Seq[T]): Double =                // 3
      samples.zipWithIndex.aggregate(0.0)(                     // 4
        (_dcg, samplej) => _dcg+ dgci(samplej._2+1, ranking(samplej._1))  // 5
          , _ + _)
}


The Ideal Discounted Cumulative Gain, iDCG is computed through the normalize method (# 1). iDCG (normalization factor) is computed by first sorting the items of type T by their value in decreasing order (# 2), then scoring this re-sorted list using the dcg method (# 3). 
The computation of the Discounted Cumulative Gain by the method dici (# 5) is a direct application of the formula (1) described in the previous chapter. The  cumulative gain in formula (2) is computed using the Scala aggregate method (# 4).

As mentioned earlier, the method dgci implements the formula (1). The logarithm function uses a base 2. It is computed as natural log(x)/natural log.


    // Implement Formula (1)
def dgci(item: T, rank: Int): Double =     Math.pow(2, item) - 1.0) / Math.log(rank + 1)


Note The DCG of subsequent samples can be computed using the same iDCG value from the same instance of NDCG.

Evaluation

Let's now consider a list of items of type Item defined as follows: 

case class Item(id: String, x: Double, rank: Int)


The list of items, itemsList is implicitly ranked through the attribute, rank using a Scala implicit

val itemsList = Seq[Item](
  Item("1", 3.4, 4), 
  Item("2", 1.4, 3),
  Item("3", -0.7, 5), 
  Item("4", 5.2, 2), 
  Item("5", 1.4, 1)
)

implicit val ranking = (item: Item) => item.rank

Finally, let's compute the NDCG coefficient for the list of items, by invoking the score method.

val nDCG = new NDCG[Item](itemsList)

println(s"IDCG = ${nDCG.iDCG}")    //45.64
println(s"Score = ${nDCG.score}")  // 0.801

The ideal discounted cumulative gain, iDCG is 45.6: It is the optimum ranking for this list of time. The first sample score a probability of 0.8 


Thank you for reading this article. For more information ...

References


---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. 
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3