In an earlier blog post, we explained how we keep on adding sources and references to our database of generic trademarks.

We try to have as much different source as possible. We constantly increase the number of newspaper references and external sources like books.

This week I analysed the quality and independency of the different sources.

Independent sources

Let us talk about whether the different sources we use are independent from each other.

I would like to avoid that we use a source that is just a copy-paste from a different source. The best way to analyse this, is using a correlation check.

In the image below, I ran a correlation check between the thirteen biggest sources we use on this website:

The maximum correlation is 0.6, meaning that there are no two sources that contain identicaly the same content. This means there are no sources that are just "copy paste" from a different source. So good news.

As you can see from the same table, it appears that "source D" is the most independant source. It is not correlated with the other sources. I am working on increasing the number of references from that gem.

Quality of sources

Using the same correlation matrix, I noticed that there are five instances where there is a negative correlation. Eg. source A en B are. Meaning that, in the most cases, there are more differences between the sources than there are similarities. This of course does not appear to be logic.

What's the reason for this negative correlation? Work in progress..