Published: March 29, 2016 byMarcel Blattner

Typical data mining algorithms make use of given features of the data to calculate their similarity and detect patterns among them. Most algorithms come with many parameters for the user to tune. Generalisation of those algorithms is very hard, since optimal parameters for a given task are highly domain dependent. The Normalized Compression Distance (NCD) in contrast is a parameter-free similarity measure with a wide range of possible applications [1].

In this blog post we show, how we use the Normalized Compression Distance to discover similarities among text documents within a corpus. Firstly, we give a short intro to the main idea of NCD. Secondly, we construct a graph (G) based on the calculated similarities, and thirdly we inspect G to find clusters. Finally, we calculate shortest paths between very distant documents in the graph. We inspect what these traces can tell us about the ‘transformation’ from one document to another.

Normalised Compression Distance (NCD)

The notion of NCD is based on a concept from information theory known as Kolmogorov complexity [2]. Kolmogorov complexity can be defined like *“an object is as complex as the time it takes to describe it”* . For instance, the string **x** =“2398383029221″ is more complex than the string **y** =“0000000000000″, because the **y** is described as “13 zeros”. **x** seems to have no clear pattern and is described only by the sequence itself. In that sense “0000000000000″ has a lower Kolmogorov complexity than “2398383029221″. However, random looking sequences can have low Kolmogorov complexity, which is discovered if one can find the underlying pattern. As an example take the sequence “31415926535897932384″. We can describe this sequence as “the first 20 digits of π”. Sometimes patterns are difficult to detect. From the first 1000 consonants of *H.G. Wells’ Time Machine * you would hardly be able to describe it as the beginning of that book. From this point of view, the true Kolmogorov complexity of a given sequence is impractical and possible not known at all. The Normalized Compression Distance is an approximation to the Kolmogorov complexity [2].

An efficient compression algorithm can reduce the length of a given input text as much as possible. If you generate a file with 1 Million 1s, any reasonable compression algorithm will shorten (compress) this sequence quite well compared to a random sequence of 1 Million digits. Since compression algorithms try to find the shortest possible sequence that describes the input data, they can be seen as a measure of complexity . We are mainly not interested in the compressed sequence but rather in the length of that sequence.

How can this complexity measure be used to estimate the distance between pairwise documents? Lets say we want to calculate the distance between two text documents D1 and D2. We would compress D1 and D2 individually first. Then we compress the concatenated version of D1D2. If the size of the concatenated version is the size of D1 + D2, nothing that was compressed in the first document D1 helped to describe what was compressed in the second document D2. Therefore D1 and D2 are to a very high degree dissimilar. If on the other hand you would compress a concatenated version D1D1 or D2D2 you get the same compression length as D1 or D2 since for example D1D1 contains no more information than D1. In that sense D1 (D2) is very similar to D1 (D2) which is intuitively clear. To be more precise. The NCD is calculated as follows: NCD(x,y) = [C(xy) – min{C(x),C(y)}] / max{C(x),C(y)}. And it is ensured that 0 <= NCD(x,y) <=1 holds. C(s) denotes the compressed length of a document s by algorithm C. bzip2, LZMA, gzip are some of the most known compression algorithms. Note: not every compressor is suited to calculate the NCD [3]. In our blog post we stick to the LZMA algorithm.

Similarity graph G

NCD(x,y) is a distance measure which is easily transformed to a similarity measure by calculating 1-NCD(x,y). For our experiment we had 869 documents at our disposal. These documents are articles published in the #12app and belong to different sections, e.g., economics, international news, national news, society etc. We calculate the pairwise NCD(x,y) using the LZMA compression algorithm and transform the distances to similarities as described above. The resulting matrix M(869×869) describes the pairwise similarities of documents within our corpus. We interpret every element m(x,y) in the matrix M as a weighted edge between document x and document y. This allows to construct a Graph (G) from M where each node represents a document and edges represent the similarities between connected nodes. We kept only the six most relevant edges for each node. Therefore, we have a directed graph. In Figure 1 we show the resulting graph. The label size represent the betweenness centrality [4]. The red path is the shortest path between two distant documents.

Figure 1. Click to enlarge.

We found 8 main clusters [5]. Cluster 0: politics, Cluster 1: economics, Cluster 2: national news mixed, Cluster 3: Society, Cluster 4: national news, society, Cluster 5: science, Cluster 6 international news, and Cluster 7: culture. We also found clear sub-clusters within the main clusters, e.g., US election in cluster 6, migrant crisis in cluster 0, medicine in cluster 5, and digital related topics in cluster 3. We also inspected various shortest paths between nodes in the network. As an example we show the shortest path (red) between a document in cluster 4 (A) and a document in cluster 3 (B). It is interesting to inspect the shortest path between the two documents. The starting article (A) describes mainly the relationship between a father and his kids. The shortest paths goes through cluster 0 and hits first an article describing the tragedy of an asylum seeker and his children. The next document – as well in cluster 0 – contains the story of the migrant crisis in general. The journey goes on and hits a story about the swiss justice and an upcoming vote concerning the migration politics. Finally arriving in cluster 3 the shortest paths hits documents on the relationship between the Swiss Government and the European Union and continues to a document related to E-Mails and cryptography. The final document (B) contains a discussion on the usage of different web browsers. We see that the shortest path gradually transforms the article on a father and his kids to the discussion on web browsers. This shows that the NCD method – as simple as it is – can capture complex relationships between documents.

We observe that the original sections were reproduced to some degree and moreover we could identify consistent sub-clusters within the main groups.

NCD is a viable tool to group or cluster documents within a corpus. It is easy to calculate and needs no parameter tuning. A tool we will use further in future.

References:

[1] Vitányi, Paul MB, et al. “Normalized information distance.” *Information theory and statistical learning* . Springer US, 2009. 45-82.

[2] Li, Ming, et al. “The similarity metric.” *Information Theory, IEEE Transactions on* 50.12 (2004): 3250-3264.

[3] Cebrián, Manuel, Manuel Alfonseca, and Alfonso Ortega. “Common pitfalls using the normalized compression distance: What to watch out for in a compressor.” *Communications in Information & Systems* 5.4 (2005): 367-384.

[4] https://en.wikipedia.org/wiki/Betweenness_centrality

[5] The clusters were extracted using the modularity measure. https://en.wikipedia.org/wiki/Modularity_(networks)

Author:

转载本站任何文章请注明：转载至神刀安全网，谢谢神刀安全网 » Normalized Compression Distance. A simple and useful method for text clustering.