Target audience: Advanced
Estimated reading time: 5'
Numerous real-life applications of machine learning require the prediction the most relevant ranking of items to optimize an outcome. This article illustrates the Normalized Discounted Cumulative Gain (NDCG) and its implementation in Scala.
What you will learn: How to implement the Discounted Cumulative Gain to assess the effectiveness of a search algorithm or recommendation system
Notes:
- Environments: Scala 2.12.11
- To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.
Discounted Cumulative Gain
Discounted Cumulative Gain (DCG) and its normalized counter part, Normalized Discounted Cumulative Gain (NDCG) is a measure used in information retrieval and ranking system evaluation to assess the effectiveness of a search algorithm or recommendation system. DCG is particularly useful for situations where the items are ranked in order of relevance or importance and the value of each item diminishes as one goes down the list of search results.
Here are examples of real-life applications that benefit from NDCG:
- Evaluate and prioritize counter-measures to cyberattack.
- Ranks symptoms in a clinical trial.
- Extract documents relevant to a given topic from a corpus.
For an indexed relevance tj as illustrated in the diagram above, the discounted cumulative gain at position n is computed as \[DGC_{j}=\frac{2^{t_{j}}-1}{log_{2}(j+1)} \ (1)\ \ \ \ DGC=\sum_{j=1}^{n} DGC_{j}\ (2) \ \ \]
The objective is to compare any given list of ranked/sorted item with a benchmark which represent the optimum ranking (ranking with the highest DCG value). \[NDGC=p(ranking | IDGC) = log(2) \frac{DGC}{IDGC}\ \ (3) \] Each item in the ranked list is assigned a relevance score, typically a positive real number, with higher scores indicating more relevant items. The relevance score of each item is discounted logarithmically based on its position in the list. This discounting reflects the notion that users are less likely to consider items as they move down the list, so items appearing earlier have more impact.
Scala implementation
Computation
The implementation of the computation of NDCG in Scala is quite simple, indeed. Given a ranked list of items. The three steps are- Compute the IDCG (or normalization factor)from the list
- Compute the DCG for the list
- Compute the NDCG = log(2).DCG/IDCF
class NDCG[T](initialSample: Seq[T])(implicit ranking: T => Int) {
import NDCG._
val iDCG: Double = normalize
def ndgc: Double = score(initialSample)
// Implement Formula (3)
def ndgc(samples: Seq[T]): Double = if( samples.size != initialSample.size) 0.0 else Math.log(2)*dcg(samples)/iDCG
private def normalize: Double = { // 1
val sorted = initialSample.zipWithIndex.sortBy{ // 2
case (sample, n) => -ranking(sample)
}.map( _._1)
dcg(sorted)
}
// Implement Formula (2)
private def dcg(samples: Seq[T]): Double = // 3
samples.zipWithIndex.aggregate(0.0)( // 4
(_dcg, samplej) => _dcg+ dgci(samplej._2+1, ranking(samplej._1)) // 5
, _ + _)
}
The Ideal Discounted Cumulative Gain, iDCG is computed through the normalize method (# 1). iDCG (normalization factor) is computed by first sorting the items of type T by their value in decreasing order (# 2), then scoring this re-sorted list using the dcg method (# 3).
The computation of the Discounted Cumulative Gain by the method dici (# 5) is a direct application of the formula (1) described in the previous chapter. The cumulative gain in formula (2) is computed using the Scala aggregate method (# 4).
As mentioned earlier, the method dgci implements the formula (1). The logarithm function uses a base 2. It is computed as natural log(x)/natural log.
// Implement Formula (1)
def dgci(item: T, rank: Int): Double = Math.pow(2, item) - 1.0) / Math.log(rank + 1)
Note The DCG of subsequent samples can be computed using the same iDCG value from the same instance of NDCG.
Evaluation
Let's now consider a list of items of type Item defined as follows:
case class Item(id: String, x: Double, rank: Int)
The list of items, itemsList is implicitly ranked through the attribute, rank using a Scala implicit
val itemsList = Seq[Item](
Item("1", 3.4, 4),
Item("2", 1.4, 3),
Item("3", -0.7, 5),
Item("4", 5.2, 2),
Item("5", 1.4, 1)
)
implicit val ranking = (item: Item) => item.rank
Finally, let's compute the NDCG coefficient for the list of items, by invoking the score method.
val nDCG = new NDCG[Item](itemsList)
println(s"IDCG = ${nDCG.iDCG}") //45.64
println(s"Score = ${nDCG.score}") // 0.801
The ideal discounted cumulative gain, iDCG is 45.6: It is the optimum ranking for this list of time. The first sample score a probability of 0.8
References
- Wikipedia: Discounted Cumulative Gain
- Kaggle: Normalized Discounted Cumulative Gain
- github.com/patnicolas
---------------------------
Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning.
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3
He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning" Packt Publishing ISBN 978-1-78712-238-3