Acknowledgements

<p>This work was supported by a STSM Grant from the <a href='http://www.cost.eu/domains_actions/mpns/Actions/TD1210'>COST Action TD1210</a>. Many thanks to colleagues whom helped one way or another during the course of this work (not implying any endorsement); in no particular order: <a href='http://bosamber.wordpress.com/'>Amber van den Bos</a> (Dakiroa), <a href='http://www.wirtschaft.bfh.ch/de/ueber_uns/kontakt/detailseite.html?tx_bfhpersonalpages_p=mam10&amp;tx_bfhpersonalpages_screen=data'>Michael Mosimann</a> (BFS), <a href='http://nl.linkedin.com/pub/anton-heijs/1/489/861'>Anton Heijs</a> (Treparel b.v.), <a href='http://en.wikipedia.org/wiki/Frank_van_Harmelen'>Frank van Harmelen</a> (VU Amsterdam).</p>

Description
<p>This work was supported by a STSM Grant from the <a href='http://www.cost.eu/domains_actions/mpns/Actions/TD1210'>COST Action TD1210</a>. Many thanks to colleagues whom helped one way or another during the course of this work (not implying any endorsement); in no particular order: <a href='http://bosamber.wordpress.com/'>Amber van den Bos</a> (Dakiroa), <a href='http://www.wirtschaft.bfh.ch/de/ueber_uns/kontakt/detailseite.html?tx_bfhpersonalpages_p=mam10&amp;tx_bfhpersonalpages_screen=data'>Michael Mosimann</a> (BFS), <a href='http://nl.linkedin.com/pub/anton-heijs/1/489/861'>Anton Heijs</a> (Treparel b.v.), <a href='http://en.wikipedia.org/wiki/Frank_van_Harmelen'>Frank van Harmelen</a> (VU Amsterdam).</p>
Name
enAcknowledgements
IsStepOfTemplate
Workflow
Label
enAnalysis.

Anonymous Reviewer

Name
enAnonymous Reviewer

Anonymous Reviewer

Name
enAnonymous Reviewer

Anonymous Reviewer

Name
enAnonymous Reviewer

Anonymous Reviewer replied on 2014-10-14 01:48:33

<div typeof='oa:TextualBody' resource='#note-20141014014833' property='rdf:value' datatype='rdf:HTML'> <p>The basic idea in the paper is good. Determining the effectiveness of links in LOD is very important. I like the way the paper is written; the style is informal. The main conclusion, though negative, is important. In fact, too few papers advance our understanding by reporting negative results. This paper is a nice exception.</p> <p>I found the procedure, written in the abstract, hard to follow. In addition, there were few concrete examples of the results of the procedure to illustrate the process. It seems to me an example illustrating each step of the process needs to be added. I expect this may add one or at most 2 to the length, certainly a tenable addition.</p> <p>The introductory joke at the start of the paper was a welcome surprise. Usually, authors sound so serious, and I found this style made me want to continue. Unfortunately, I found the presentation a little rough. In addition to the lack of example cited above, some areas need a sentence or two more explanation. For instance, LSI may not be known to all readers. The reference is good, but if you say a sentence about what it does or accomplishes, why it is relevant to the study, then that puts the unfamiliar reader at ease. (S)he can check the reference later. Otherwise (s)he is forced to in order to continue reading. Similarly, LSD Sense is not defined. The reference indicates there is code on a Github site. You need to devote a paragraph to what this (LSD Sense) is and does.</p> <p>Relevant also to the lack of examples, it is not at all clear how the elements in the Data Cube vocabulary are used in the experiment. Similarly, the World Bank list of indicators data sets is not spelled out or illustrated.</p> <p>Basically, you looked at data set names. Give some examples!</p> <p>What are the reference areas, reference periods, and values you considered? These are left to the reader to guess.</p> <p>You need to have a full set of examples illustrating the problem, the procedures, and the results.</p></div>

AnnotatedAt
14 October 2014 03:48
Content
<div typeof='oa:TextualBody' resource='#note-20141014014833' property='rdf:value' datatype='rdf:HTML'> <p>The basic idea in the paper is good. Determining the effectiveness of links in LOD is very important. I like the way the paper is written; the style is informal. The main conclusion, though negative, is important. In fact, too few papers advance our understanding by reporting negative results. This paper is a nice exception.</p> <p>I found the procedure, written in the abstract, hard to follow. In addition, there were few concrete examples of the results of the procedure to illustrate the process. It seems to me an example illustrating each step of the process needs to be added. I expect this may add one or at most 2 to the length, certainly a tenable addition.</p> <p>The introductory joke at the start of the paper was a welcome surprise. Usually, authors sound so serious, and I found this style made me want to continue. Unfortunately, I found the presentation a little rough. In addition to the lack of example cited above, some areas need a sentence or two more explanation. For instance, LSI may not be known to all readers. The reference is good, but if you say a sentence about what it does or accomplishes, why it is relevant to the study, then that puts the unfamiliar reader at ease. (S)he can check the reference later. Otherwise (s)he is forced to in order to continue reading. Similarly, LSD Sense is not defined. The reference indicates there is code on a Github site. You need to devote a paragraph to what this (LSD Sense) is and does.</p> <p>Relevant also to the lack of examples, it is not at all clear how the elements in the Data Cube vocabulary are used in the experiment. Similarly, the World Bank list of indicators data sets is not spelled out or illustrated.</p> <p>Basically, you looked at data set names. Give some examples!</p> <p>What are the reference areas, reference periods, and values you considered? These are left to the reader to guess.</p> <p>You need to have a full set of examples illustrating the problem, the procedures, and the results.</p></div>
Creator
enAnonymous Reviewer
DatePublished
14 October 2014 03:48
Description
<div typeof='oa:TextualBody' resource='#note-20141014014833' property='rdf:value' datatype='rdf:HTML'> <p>The basic idea in the paper is good. Determining the effectiveness of links in LOD is very important. I like the way the paper is written; the style is informal. The main conclusion, though negative, is important. In fact, too few papers advance our understanding by reporting negative results. This paper is a nice exception.</p> <p>I found the procedure, written in the abstract, hard to follow. In addition, there were few concrete examples of the results of the procedure to illustrate the process. It seems to me an example illustrating each step of the process needs to be added. I expect this may add one or at most 2 to the length, certainly a tenable addition.</p> <p>The introductory joke at the start of the paper was a welcome surprise. Usually, authors sound so serious, and I found this style made me want to continue. Unfortunately, I found the presentation a little rough. In addition to the lack of example cited above, some areas need a sentence or two more explanation. For instance, LSI may not be known to all readers. The reference is good, but if you say a sentence about what it does or accomplishes, why it is relevant to the study, then that puts the unfamiliar reader at ease. (S)he can check the reference later. Otherwise (s)he is forced to in order to continue reading. Similarly, LSD Sense is not defined. The reference indicates there is code on a Github site. You need to devote a paragraph to what this (LSD Sense) is and does.</p> <p>Relevant also to the lack of examples, it is not at all clear how the elements in the Data Cube vocabulary are used in the experiment. Similarly, the World Bank list of indicators data sets is not spelled out or illustrated.</p> <p>Basically, you looked at data set names. Give some examples!</p> <p>What are the reference areas, reference periods, and values you considered? These are left to the reader to guess.</p> <p>You need to have a full set of examples illustrating the problem, the procedures, and the results.</p></div>
HasBody
note-20141014014833
HasTarget
Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo
Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy
replying
Name
enAnonymous Reviewer replied on 2014-10-14 01:48:33
Published
14 October 2014 03:48

Anonymous Reviewer replied on 2014-10-18 00:30:02

<div typeof='oa:TextualBody' resource='#note-20141018003002' property='rdf:value' datatype='rdf:HTML'> <p>The subject of the paper, exploring dataset titles for similary of datasets, seem to be questionable. Looking at titles - for comparison purposes - is a very poor means. If this is done, then it would only make sense, if nothing else is available. But this was not mentioned in the paper.</p> <p>A dataset, which claims to have some value, has more information than just a title. It has a dataset description and much more descriptive information. Here, data in the form of data cubes are explored. Then definitions of dimensions and code lists are available. Exploring these definitions and codelists has much more value for comparison purposes. The whole exercise of the paper should be applied to this area.</p> <p>Comparison is usually mainly done on the basis of similar or equal definitions of variables (microdata) or dimensions(aggregate data).</p> <p>Some details: Statistical and semantic comparability are mentioned but not clearly described what it is meant in this context.</p> <p>Regarding the subject: I would like just to underline that a dataset, which claims to have some value, has more information than just a title. Then this additional information should be used for any usage.</p> <p>Using datasets (especially in comparison) just according to titles without having information available like a description of the data or definitions of the columns/variables is questionable. It doesn't seem to make sense because the data is generated to describe something. But with poor information on the data one doesn't know what the data describes. That means one does speculations. Does this make sense? Comparison of data is a rather complicated animal. It is necessary to understand the data. This could be the basis for comparison.</p> <p>Comparison approaches should really more focus on descriptions/definitions of the data respectively the columns/variables which are the compounds of the data.</p> <p>Any work on just the titles seems to focus on something which can be done indeed but it doesn't focus on something which really makes sense in a general way. This type of work could only make sense if one has no other information available than titles. Then this approach can be used to decipher the object of which the data describes. But one has to be aware that even then the understanding is very limited.</p></div>

AnnotatedAt
18 October 2014 02:30
Content
<div typeof='oa:TextualBody' resource='#note-20141018003002' property='rdf:value' datatype='rdf:HTML'> <p>The subject of the paper, exploring dataset titles for similary of datasets, seem to be questionable. Looking at titles - for comparison purposes - is a very poor means. If this is done, then it would only make sense, if nothing else is available. But this was not mentioned in the paper.</p> <p>A dataset, which claims to have some value, has more information than just a title. It has a dataset description and much more descriptive information. Here, data in the form of data cubes are explored. Then definitions of dimensions and code lists are available. Exploring these definitions and codelists has much more value for comparison purposes. The whole exercise of the paper should be applied to this area.</p> <p>Comparison is usually mainly done on the basis of similar or equal definitions of variables (microdata) or dimensions(aggregate data).</p> <p>Some details: Statistical and semantic comparability are mentioned but not clearly described what it is meant in this context.</p> <p>Regarding the subject: I would like just to underline that a dataset, which claims to have some value, has more information than just a title. Then this additional information should be used for any usage.</p> <p>Using datasets (especially in comparison) just according to titles without having information available like a description of the data or definitions of the columns/variables is questionable. It doesn't seem to make sense because the data is generated to describe something. But with poor information on the data one doesn't know what the data describes. That means one does speculations. Does this make sense? Comparison of data is a rather complicated animal. It is necessary to understand the data. This could be the basis for comparison.</p> <p>Comparison approaches should really more focus on descriptions/definitions of the data respectively the columns/variables which are the compounds of the data.</p> <p>Any work on just the titles seems to focus on something which can be done indeed but it doesn't focus on something which really makes sense in a general way. This type of work could only make sense if one has no other information available than titles. Then this approach can be used to decipher the object of which the data describes. But one has to be aware that even then the understanding is very limited.</p></div>
Creator
enAnonymous Reviewer
DatePublished
18 October 2014 02:30
Description
<div typeof='oa:TextualBody' resource='#note-20141018003002' property='rdf:value' datatype='rdf:HTML'> <p>The subject of the paper, exploring dataset titles for similary of datasets, seem to be questionable. Looking at titles - for comparison purposes - is a very poor means. If this is done, then it would only make sense, if nothing else is available. But this was not mentioned in the paper.</p> <p>A dataset, which claims to have some value, has more information than just a title. It has a dataset description and much more descriptive information. Here, data in the form of data cubes are explored. Then definitions of dimensions and code lists are available. Exploring these definitions and codelists has much more value for comparison purposes. The whole exercise of the paper should be applied to this area.</p> <p>Comparison is usually mainly done on the basis of similar or equal definitions of variables (microdata) or dimensions(aggregate data).</p> <p>Some details: Statistical and semantic comparability are mentioned but not clearly described what it is meant in this context.</p> <p>Regarding the subject: I would like just to underline that a dataset, which claims to have some value, has more information than just a title. Then this additional information should be used for any usage.</p> <p>Using datasets (especially in comparison) just according to titles without having information available like a description of the data or definitions of the columns/variables is questionable. It doesn't seem to make sense because the data is generated to describe something. But with poor information on the data one doesn't know what the data describes. That means one does speculations. Does this make sense? Comparison of data is a rather complicated animal. It is necessary to understand the data. This could be the basis for comparison.</p> <p>Comparison approaches should really more focus on descriptions/definitions of the data respectively the columns/variables which are the compounds of the data.</p> <p>Any work on just the titles seems to focus on something which can be done indeed but it doesn't focus on something which really makes sense in a general way. This type of work could only make sense if one has no other information available than titles. Then this approach can be used to decipher the object of which the data describes. But one has to be aware that even then the understanding is very limited.</p></div>
HasBody
note-20141018003002
HasTarget
Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo
Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy
replying
Name
enAnonymous Reviewer replied on 2014-10-18 00:30:02
Published
18 October 2014 02:30

Anonymous Reviewer replied on 2014-10-18 00:30:27

<div typeof='oa:TextualBody' resource='#note-20141018003027' property='rdf:value' datatype='rdf:HTML'> <p>Semantic similarity and statistical correlation is an age-old problem, and it is an interesting idea to tackle it in the linked data context. But there are so many pitfalls in this area that you have to be very precise and careful. The problem here is that I simply cannot understand the formulation of the work hypotheses in 2.2. The notion of "semantic similarity between datasets" should be defined, because it makes no sense to me.</p> <p>The description of the workflow is too sketchy, and the results section is too short (and the figures not legible.)</p> <p>In short: interesting topic, but the paper is not finished. I will be happy to give a better evaluation when it is more complete.</p></div>

AnnotatedAt
18 October 2014 02:30
Content
<div typeof='oa:TextualBody' resource='#note-20141018003027' property='rdf:value' datatype='rdf:HTML'> <p>Semantic similarity and statistical correlation is an age-old problem, and it is an interesting idea to tackle it in the linked data context. But there are so many pitfalls in this area that you have to be very precise and careful. The problem here is that I simply cannot understand the formulation of the work hypotheses in 2.2. The notion of "semantic similarity between datasets" should be defined, because it makes no sense to me.</p> <p>The description of the workflow is too sketchy, and the results section is too short (and the figures not legible.)</p> <p>In short: interesting topic, but the paper is not finished. I will be happy to give a better evaluation when it is more complete.</p></div>
Creator
enAnonymous Reviewer
DatePublished
18 October 2014 02:30
Description
<div typeof='oa:TextualBody' resource='#note-20141018003027' property='rdf:value' datatype='rdf:HTML'> <p>Semantic similarity and statistical correlation is an age-old problem, and it is an interesting idea to tackle it in the linked data context. But there are so many pitfalls in this area that you have to be very precise and careful. The problem here is that I simply cannot understand the formulation of the work hypotheses in 2.2. The notion of "semantic similarity between datasets" should be defined, because it makes no sense to me.</p> <p>The description of the workflow is too sketchy, and the results section is too short (and the figures not legible.)</p> <p>In short: interesting topic, but the paper is not finished. I will be happy to give a better evaluation when it is more complete.</p></div>
HasBody
note-20141018003027
HasTarget
Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo
Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy
replying
Name
enAnonymous Reviewer replied on 2014-10-18 00:30:27
Published
18 October 2014 02:30

Armin Haller

Name
enArmin Haller
Url
Armin Haller

Armin Haller replied on 2014-10-18 00:31:28

<div typeof='oa:TextualBody' resource='#note-20141018003128' property='rdf:value' datatype='rdf:HTML'> <p>I think there are two main problems, why the reviews weren't that great. The first is the Hypothesis is unclear/semantically wrong "If the absence of semantically rich connection between datasets is inadequate". There is some double negation in there which I don't understand and which I think non of the reviewers understood (even though we can guess what you meant). The second problem was the lack of examples. Dataset titles is broad and you did not provide a definition of what are your titles (sub-datasets are included?) and what did you use, the labels, the URI, and why not a bit more than just the titles? It may have been more meaningful to extract some keywords from the title, the description and especially the DTDs. Just on the titles seems a bit restrictive. I know it is a workshop paper, but the reviewers probably thought from the beginning that based on the titles there wouldn't be much overlap/correlation.</p> <p>I really like the idea, though, and even though the last reviewer pointed out that semantic similiartiy and statistical correlation is an age-old problem, I have not seen a single paper doing that in our domain. What if you enlarge the correlation analysis with the DTDs. I think that would make it a lot more convincing and might even result in not rejecting your null hypothesis. Also, why not doing it on more datasets? You probably don't have enough datasets in one domain, I guess? Happy to contribute to that. Also, including some of the ABS datasets might make it even better.</p></div>

Actor
Armin Haller
AnnotatedAt
18 October 2014 02:31
AnnotatedBy
Armin Haller
Content
<div typeof='oa:TextualBody' resource='#note-20141018003128' property='rdf:value' datatype='rdf:HTML'> <p>I think there are two main problems, why the reviews weren't that great. The first is the Hypothesis is unclear/semantically wrong "If the absence of semantically rich connection between datasets is inadequate". There is some double negation in there which I don't understand and which I think non of the reviewers understood (even though we can guess what you meant). The second problem was the lack of examples. Dataset titles is broad and you did not provide a definition of what are your titles (sub-datasets are included?) and what did you use, the labels, the URI, and why not a bit more than just the titles? It may have been more meaningful to extract some keywords from the title, the description and especially the DTDs. Just on the titles seems a bit restrictive. I know it is a workshop paper, but the reviewers probably thought from the beginning that based on the titles there wouldn't be much overlap/correlation.</p> <p>I really like the idea, though, and even though the last reviewer pointed out that semantic similiartiy and statistical correlation is an age-old problem, I have not seen a single paper doing that in our domain. What if you enlarge the correlation analysis with the DTDs. I think that would make it a lot more convincing and might even result in not rejecting your null hypothesis. Also, why not doing it on more datasets? You probably don't have enough datasets in one domain, I guess? Happy to contribute to that. Also, including some of the ABS datasets might make it even better.</p></div>
Creator
Armin Haller
DatePublished
18 October 2014 02:31
Description
<div typeof='oa:TextualBody' resource='#note-20141018003128' property='rdf:value' datatype='rdf:HTML'> <p>I think there are two main problems, why the reviews weren't that great. The first is the Hypothesis is unclear/semantically wrong "If the absence of semantically rich connection between datasets is inadequate". There is some double negation in there which I don't understand and which I think non of the reviewers understood (even though we can guess what you meant). The second problem was the lack of examples. Dataset titles is broad and you did not provide a definition of what are your titles (sub-datasets are included?) and what did you use, the labels, the URI, and why not a bit more than just the titles? It may have been more meaningful to extract some keywords from the title, the description and especially the DTDs. Just on the titles seems a bit restrictive. I know it is a workshop paper, but the reviewers probably thought from the beginning that based on the titles there wouldn't be much overlap/correlation.</p> <p>I really like the idea, though, and even though the last reviewer pointed out that semantic similiartiy and statistical correlation is an age-old problem, I have not seen a single paper doing that in our domain. What if you enlarge the correlation analysis with the DTDs. I think that would make it a lot more convincing and might even result in not rejecting your null hypothesis. Also, why not doing it on more datasets? You probably don't have enough datasets in one domain, I guess? Happy to contribute to that. Also, including some of the ABS datasets might make it even better.</p></div>
HasBody
note-20141018003128
HasTarget
Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo
Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy
replying
Name
enArmin Haller replied on 2014-10-18 00:31:28
Published
18 October 2014 02:31
Name
enBern University of Applied Sciences
Url
http://bfh.ch/

Conclusions and Future Work

<p>We believe that the presented work here and the prior Linked Statistical Data Analysis effort contributes towards strengthening the relationship between Semantic Web and statistical research. What we have set out to investigate was to minimize human involvement for discovering useful correlations in statistical data. We have implemented a workflow in which we can automate the analysis process, from data retrieval to outputting analysis results for candidate semantic linkages in Linked Statistical Data.</p> <p>We have evaluated our results by testing and verifying the null hypothesis which we have put forward. While it turned out that the semantic similarity between datasets titles were not useful to determine strong and meaningful correlations — which is a useful finding, in any case — it left us with the remaining alternative hypothesis that can be used in future research.</p> <p resource='#future-work' rel='schema:hasPart' about=''><span property='schema:description' typeof='deo:FutureWork' about='#future-work' id='future-work'>Possibly fruitful future work might want to run a similar experiment with the semantic similarity of dataset descriptions, test manually configured useful relations for a controlled set of datasets, or looking into interlinked topic domains across linked dataspaces.</span></p> <p>Where is <em>interestingness</em> hidden?</p>

Description
<p>We believe that the presented work here and the prior Linked Statistical Data Analysis effort contributes towards strengthening the relationship between Semantic Web and statistical research. What we have set out to investigate was to minimize human involvement for discovering useful correlations in statistical data. We have implemented a workflow in which we can automate the analysis process, from data retrieval to outputting analysis results for candidate semantic linkages in Linked Statistical Data.</p> <p>We have evaluated our results by testing and verifying the null hypothesis which we have put forward. While it turned out that the semantic similarity between datasets titles were not useful to determine strong and meaningful correlations — which is a useful finding, in any case — it left us with the remaining alternative hypothesis that can be used in future research.</p> <p resource='#future-work' rel='schema:hasPart' about=''><span property='schema:description' typeof='deo:FutureWork' about='#future-work' id='future-work'>Possibly fruitful future work might want to run a similar experiment with the semantic similarity of dataset descriptions, test manually configured useful relations for a controlled set of datasets, or looking into interlinked topic domains across linked dataspaces.</span></p> <p>Where is <em>interestingness</em> hidden?</p>
Name
enConclusions and Future Work

Correlation analysis with variables semantic similarity and correlation of dataset

<p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values &lt;0.05 and &gt;0.95, as well as correlation values with <em>p</em>-value&gt;0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p>

Description
<p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values &lt;0.05 and &gt;0.95, as well as correlation values with <em>p</em>-value&gt;0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p>
Name
enCorrelation analysis with variables semantic similarity and correlation of dataset

Correlations for each dataset pair

<p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n&gt;10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n&lt;50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p>

Description
<p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n&gt;10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n&lt;50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p>
Name
enCorrelations for each dataset pair
IsStepOfTemplate
Workflow
Label
enCreate correlation and other analysis using variables semantic similarity and correlation of LSD.
IsStepOfTemplate
Workflow
Label
enCreate correlations and other analysis for each dataset pair combination.
IsStepOfTemplate
Workflow
Label
enCreate dataset metadata subset for semantic similarity.
IsStepOfTemplate
Workflow
Label
enCreate hypothesis
IsStepOfTemplate
Workflow
Label
enCreate semantic similarity for each dataset pair combination.

Data

<p>We decided to conduct our experiment on a simple dataset structure, containing two dimensions; <em>reference area</em>, and <em>reference period</em>, and one measure <em>value</em> for its observations, where the <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/dataset/world-bank-indicators' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators' id='data-worldbank-indicators'>World Bank indicators</a></span> was a good candidate from the 270a Cloud. The rationale for using only one dataspace (at this time) was to remain within a consistent classification space to measure semantic similarity. We fixed the reference period to 2012, and datasets that are part of <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/classification/topic/4' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators-topic-4' id='data-worldbank-indicators-topic-4'>World Bank's education topic</a></span>. We have identified one downside concerning the data quality i.e., the attribute/unit information was incorporated as part of the dataset title, usually as a suffix within brackets. We dealt with this by removing the attribute information from the titles as part of preprocessing in the semantic similarity phase.</p>

Description
<p>We decided to conduct our experiment on a simple dataset structure, containing two dimensions; <em>reference area</em>, and <em>reference period</em>, and one measure <em>value</em> for its observations, where the <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/dataset/world-bank-indicators' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators' id='data-worldbank-indicators'>World Bank indicators</a></span> was a good candidate from the 270a Cloud. The rationale for using only one dataspace (at this time) was to remain within a consistent classification space to measure semantic similarity. We fixed the reference period to 2012, and datasets that are part of <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/classification/topic/4' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators-topic-4' id='data-worldbank-indicators-topic-4'>World Bank's education topic</a></span>. We have identified one downside concerning the data quality i.e., the attribute/unit information was incorporated as part of the dataset title, usually as a suffix within brackets. We dealt with this by removing the attribute information from the titles as part of preprocessing in the semantic similarity phase.</p>
Name
enData
IsStepOfTemplate
Workflow
Label
enDetermine datasets and configurations

Experiment

<p>Two experiments were conducted using the same workflow. Experiments differed only by their input data. In the first experiment, the analysis was done for a particular reference year over all available datasets. In the second experiment, however, we restricted the data further for only a particular dataset domain (topic), thereby making it possible to compare whether a control over a topic can be significant for semantic similarity of the dataset titles.</p> <section resource='#data' rel='schema:hasPart' id='data'> <h3 property='schema:name'>Data</h3> <div typeof='deo:Data' resource='#data' property='schema:description' datatype='rdf:HTML'> <p>We decided to conduct our experiment on a simple dataset structure, containing two dimensions; <em>reference area</em>, and <em>reference period</em>, and one measure <em>value</em> for its observations, where the <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/dataset/world-bank-indicators' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators' id='data-worldbank-indicators'>World Bank indicators</a></span> was a good candidate from the 270a Cloud. The rationale for using only one dataspace (at this time) was to remain within a consistent classification space to measure semantic similarity. We fixed the reference period to 2012, and datasets that are part of <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/classification/topic/4' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators-topic-4' id='data-worldbank-indicators-topic-4'>World Bank's education topic</a></span>. We have identified one downside concerning the data quality i.e., the attribute/unit information was incorporated as part of the dataset title, usually as a suffix within brackets. We dealt with this by removing the attribute information from the titles as part of preprocessing in the semantic similarity phase.</p> </div> </section> <section about='#experiment-workflow-worldbank' rel='schema:hasPart' id='experiment-workflow-worldbank'> <h3 property='schema:name'>World Bank Indicators workflow</h3> <div property='schema:description' datatype='rdf:HTML'> <p>The workflow of our experiment is summarized as follows:</p> <section resource='#experiment-workflow-worldbank-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-correlations'> <h4 property='schema:name'>Correlations for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n&gt;10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n&lt;50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity'> <h4 property='schema:name'>Semantic similarity for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity-correlations'> <h4 property='schema:name'>Correlation analysis with variables semantic similarity and correlation of dataset</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values &lt;0.05 and &gt;0.95, as well as correlation values with <em>p</em>-value&gt;0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p> </div> </section> </div> </section>

Description
<p>Two experiments were conducted using the same workflow. Experiments differed only by their input data. In the first experiment, the analysis was done for a particular reference year over all available datasets. In the second experiment, however, we restricted the data further for only a particular dataset domain (topic), thereby making it possible to compare whether a control over a topic can be significant for semantic similarity of the dataset titles.</p> <section resource='#data' rel='schema:hasPart' id='data'> <h3 property='schema:name'>Data</h3> <div typeof='deo:Data' resource='#data' property='schema:description' datatype='rdf:HTML'> <p>We decided to conduct our experiment on a simple dataset structure, containing two dimensions; <em>reference area</em>, and <em>reference period</em>, and one measure <em>value</em> for its observations, where the <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/dataset/world-bank-indicators' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators' id='data-worldbank-indicators'>World Bank indicators</a></span> was a good candidate from the 270a Cloud. The rationale for using only one dataspace (at this time) was to remain within a consistent classification space to measure semantic similarity. We fixed the reference period to 2012, and datasets that are part of <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/classification/topic/4' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators-topic-4' id='data-worldbank-indicators-topic-4'>World Bank's education topic</a></span>. We have identified one downside concerning the data quality i.e., the attribute/unit information was incorporated as part of the dataset title, usually as a suffix within brackets. We dealt with this by removing the attribute information from the titles as part of preprocessing in the semantic similarity phase.</p> </div> </section> <section about='#experiment-workflow-worldbank' rel='schema:hasPart' id='experiment-workflow-worldbank'> <h3 property='schema:name'>World Bank Indicators workflow</h3> <div property='schema:description' datatype='rdf:HTML'> <p>The workflow of our experiment is summarized as follows:</p> <section resource='#experiment-workflow-worldbank-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-correlations'> <h4 property='schema:name'>Correlations for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n&gt;10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n&lt;50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity'> <h4 property='schema:name'>Semantic similarity for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity-correlations'> <h4 property='schema:name'>Correlation analysis with variables semantic similarity and correlation of dataset</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values &lt;0.05 and &gt;0.95, as well as correlation values with <em>p</em>-value&gt;0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p> </div> </section> </div> </section>
HasPart
Data
Name
enExperiment
HasPart

World Bank Indicators workflow

<p>The workflow of our experiment is summarized as follows:</p> <section resource='#experiment-workflow-worldbank-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-correlations'> <h4 property='schema:name'>Correlations for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n&gt;10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n&lt;50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity'> <h4 property='schema:name'>Semantic similarity for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity-correlations'> <h4 property='schema:name'>Correlation analysis with variables semantic similarity and correlation of dataset</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values &lt;0.05 and &gt;0.95, as well as correlation values with <em>p</em>-value&gt;0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p> </div> </section>

Description
<p>The workflow of our experiment is summarized as follows:</p> <section resource='#experiment-workflow-worldbank-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-correlations'> <h4 property='schema:name'>Correlations for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n&gt;10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n&lt;50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity'> <h4 property='schema:name'>Semantic similarity for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity-correlations'> <h4 property='schema:name'>Correlation analysis with variables semantic similarity and correlation of dataset</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values &lt;0.05 and &gt;0.95, as well as correlation values with <em>p</em>-value&gt;0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p> </div> </section>
HasPart
Correlation analysis with variables semantic similarity and correlation of dataset
Correlations for each dataset pair
Semantic similarity for each dataset pair
Name
enWorld Bank Indicators workflow

future-work

Possibly fruitful future work might want to run a similar experiment with the semantic similarity of dataset descriptions, test manually configured useful relations for a controlled set of datasets, or looking into interlinked topic domains across linked dataspaces.

Description
enPossibly fruitful future work might want to run a similar experiment with the semantic similarity of dataset descriptions, test manually configured useful relations for a controlled set of datasets, or looking into interlinked topic domains across linked dataspaces.
IsStepOfTemplate
Workflow
Label
enGet each dataset's observations.
IsStepOfTemplate
Workflow
Label
enGet metadata of datasets.

Hypothesis

<p>Given our research question, we would like to propose a viable research hypothesis, followed by our investigation with the null hypothesis:</p> <p typeof='sio:SIO_000284' resource='#hypothesis-alternative' rel='schema:hasPart' id='hypothesis-alternative'><span property='rdfs:label'>H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity.</span></p> <p typeof='sio:SIO_000284' resource='#hypothesis-null' rel='schema:hasPart' id='hypothesis-null'><span property='rdfs:label'>H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.</span></p> <span typeof='sio:SIO_000367' resource='#hypothesis-null' rel='schema:isPartOf' property='rdfs:label' content='semantic similarity' about='#variable-semantic-similarity-lsd-titles'></span><span typeof='sio:SIO_000367' resource='#hypothesis-null' rel='schema:isPartOf' property='rdfs:label' content='correlation' about='#variable-correlation-lsd'></span> <p>We set the significance level to 5% probability.</p>

Description
<p>Given our research question, we would like to propose a viable research hypothesis, followed by our investigation with the null hypothesis:</p> <p typeof='sio:SIO_000284' resource='#hypothesis-alternative' rel='schema:hasPart' id='hypothesis-alternative'><span property='rdfs:label'>H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity.</span></p> <p typeof='sio:SIO_000284' resource='#hypothesis-null' rel='schema:hasPart' id='hypothesis-null'><span property='rdfs:label'>H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.</span></p> <span typeof='sio:SIO_000367' resource='#hypothesis-null' rel='schema:isPartOf' property='rdfs:label' content='semantic similarity' about='#variable-semantic-similarity-lsd-titles'></span><span typeof='sio:SIO_000367' resource='#hypothesis-null' rel='schema:isPartOf' property='rdfs:label' content='correlation' about='#variable-correlation-lsd'></span> <p>We set the significance level to 5% probability.</p>
HasPart
H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.
H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity.
Name
enHypothesis
Label
enH₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.
Label
enH₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity.

Implementation

<p>We have an implementation of the <a href='https://github.com/csarven/lsd-sense'>LSD Sense</a> workflow which can be used to both, reproduce our experiments, as well as run it on new input datasets. With the exception of determining which datasets to inspect, and the system configuration, LSD Sense is automated.</p> <p id='semantic-correlation'><strong>Semantic Correlation</strong>: The semantic similarity algorithm is based on a <a href='http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf'>Latent Semantic Index</a> (LSI) [<a href='#ref-4'>4</a>]. We use the dataset titles to check for their similarity. Essentially, LSI puts each dataset title into a cluster. The number of clusters can be adjusted (default to 200). It remains as an open research question as to what it should be. Generally, research has demonstrated that optimal values depend on the size and nature of the dataset [<a href='#ref-5'>5</a>]. We use <a href='http://radimrehurek.com/gensim/index.html'>gensim</a> [<a href='#ref-6'>6</a>] in our <a href='https://github.com/albertmeronyo/SemanticCorrelation'>Semantic Correlation</a> [<a href='#ref-7'>7</a>] implementation for LSD Sense.</p> <p>Concerning the quality of the dataset titles, it is possible to come across datasets that differ only by one word e.g., <q>male</q>, <q>female</q>. This potentially lowers the accuracy to differentiate datasets. As mentioned earlier, we removed the attribute information from the dataset titles with the assumption that it reduced noise.</p>

Description
<p>We have an implementation of the <a href='https://github.com/csarven/lsd-sense'>LSD Sense</a> workflow which can be used to both, reproduce our experiments, as well as run it on new input datasets. With the exception of determining which datasets to inspect, and the system configuration, LSD Sense is automated.</p> <p id='semantic-correlation'><strong>Semantic Correlation</strong>: The semantic similarity algorithm is based on a <a href='http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf'>Latent Semantic Index</a> (LSI) [<a href='#ref-4'>4</a>]. We use the dataset titles to check for their similarity. Essentially, LSI puts each dataset title into a cluster. The number of clusters can be adjusted (default to 200). It remains as an open research question as to what it should be. Generally, research has demonstrated that optimal values depend on the size and nature of the dataset [<a href='#ref-5'>5</a>]. We use <a href='http://radimrehurek.com/gensim/index.html'>gensim</a> [<a href='#ref-6'>6</a>] in our <a href='https://github.com/albertmeronyo/SemanticCorrelation'>Semantic Correlation</a> [<a href='#ref-7'>7</a>] implementation for LSD Sense.</p> <p>Concerning the quality of the dataset titles, it is possible to come across datasets that differ only by one word e.g., <q>male</q>, <q>female</q>. This potentially lowers the accuracy to differentiate datasets. As mentioned earlier, we removed the attribute information from the dataset titles with the assumption that it reduced noise.</p>
Name
enImplementation

Introduction

There was this American who was afraid of a heart attack and he found out that the Japanese ate very little fat and almost did not drink wine but they had much less heart attacks than the Americans. But on the other hand he also found out that the French eat as much fat as the Americans and they drink much more wine but they also have less heart attacks. So he concluded that what kills you is speaking English [1]. While computers can assist us in discovering strong correlations in large amounts of statistical datasets, whether by chance or through sophisticated methods, humans (or sometimes also known as domain experts) still need to be critical about the results and interpret them appropriately. This implies that we are still very much involved in the process in discovering meaningful correlations by filtering through everything that is presented to us. If we could however improve the situation slightly by having machines present us with only useful correlations from a random mass of correlations, then we can give more of our attention to what is interesting. Hence, our goal is to set a path towards identifying why some variables have a semantic link between them. Before we establish that, our ongoing approach (as outlined in this research and afterwards) will be to refute or cancel out things which may be in disguise for semantic similarity. Therefore, we set our investigation with a workflow to experiment with Linked Statistical Datasets in the 270a Cloud [2]. We have first set our hypothesis to uncover the possibility that semantically similar variables or datasets need to incorporate semantically rich information in order to find thought-provoking correlations. Then, the question is, what do exceptional or intriguing linkages for semantic similarity look like? We start with our null hypothesis by checking to see whether the dataset titles in World Bank indicators can help indicate strong correlations. Our results show that dataset titles by themselves or within a particular topic area is not a good indicator to predict correlations.

Description
en There was this American who was afraid of a heart attack and he found out that the Japanese ate very little fat and almost did not drink wine but they had much less heart attacks than the Americans. But on the other hand he also found out that the French eat as much fat as the Americans and they drink much more wine but they also have less heart attacks. So he concluded that what kills you is speaking English [1]. While computers can assist us in discovering strong correlations in large amounts of statistical datasets, whether by chance or through sophisticated methods, humans (or sometimes also known as domain experts) still need to be critical about the results and interpret them appropriately. This implies that we are still very much involved in the process in discovering meaningful correlations by filtering through everything that is presented to us. If we could however improve the situation slightly by having machines present us with only useful correlations from a random mass of correlations, then we can give more of our attention to what is interesting. Hence, our goal is to set a path towards identifying why some variables have a semantic link between them. Before we establish that, our ongoing approach (as outlined in this research and afterwards) will be to refute or cancel out things which may be in disguise for semantic similarity. Therefore, we set our investigation with a workflow to experiment with Linked Statistical Datasets in the 270a Cloud [2]. We have first set our hypothesis to uncover the possibility that semantically similar variables or datasets need to incorporate semantically rich information in order to find thought-provoking correlations. Then, the question is, what do exceptional or intriguing linkages for semantic similarity look like? We start with our null hypothesis by checking to see whether the dataset titles in World Bank indicators can help indicate strong correlations. Our results show that dataset titles by themselves or within a particular topic area is not a good indicator to predict correlations.
Name
enIntroduction

Linked Statistical Data and Semantic Similarity

<p>The RDF Data Cube vocabulary does not only allow one to express statistical data in a Web exchangeable format, but also to represent the (semantic) links within those statistical data. This ability poses some new interesting research questions around the relationship between the statistical and semantic relatedness of datasets. We are interested in the interplay of statistical correlation of LSD and their semantic similarity, in order to answer questions like: Does correlation between statistical datasets imply some kind of semantic relation? Do certain semantic links imply the existence of correlation? We propose a generic workflow for studying whether or not this relation between correlation and similarity holds for arbitrary LSD. We aim at generic correlation and similarity measures, and our workflow enables the use of any correlation and similarity indicators. For the specific goal of this paper, though, we stick to the use of Kendall's correlation coefficient and Latent Semantic Analysis (LSA) similarity.</p>

Description
<p>The RDF Data Cube vocabulary does not only allow one to express statistical data in a Web exchangeable format, but also to represent the (semantic) links within those statistical data. This ability poses some new interesting research questions around the relationship between the statistical and semantic relatedness of datasets. We are interested in the interplay of statistical correlation of LSD and their semantic similarity, in order to answer questions like: Does correlation between statistical datasets imply some kind of semantic relation? Do certain semantic links imply the existence of correlation? We propose a generic workflow for studying whether or not this relation between correlation and similarity holds for arbitrary LSD. We aim at generic correlation and similarity measures, and our workflow enables the use of any correlation and similarity indicators. For the specific goal of this paper, though, we stick to the use of Kendall's correlation coefficient and Latent Semantic Analysis (LSA) similarity.</p>
Name
enLinked Statistical Data and Semantic Similarity

Methodology

We first state our research design and hypothesis, then discuss how we employed Linked Statistical Data (LSD) and Semantic Similarity approaches for a workflow in our LSD Sense [3] implementation. Research design Research problem: Why do machines have difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets? Put another way: How can machines uncover interesting correlations? Over this ongoing investigation, we want to uncover some of the fundamental components for measuring and declaring semantic similarity between datasets, in order to better predict relevant strong relationships. Can semantic relatedness between datasets imply statistical correlation of the related data points in the datasets? Hypothesis Given our research question, we would like to propose a viable research hypothesis, followed by our investigation with the null hypothesis: H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity. H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity. We set the significance level to 5% probability. Linked Statistical Data and Semantic Similarity The RDF Data Cube vocabulary does not only allow one to express statistical data in a Web exchangeable format, but also to represent the (semantic) links within those statistical data. This ability poses some new interesting research questions around the relationship between the statistical and semantic relatedness of datasets. We are interested in the interplay of statistical correlation of LSD and their semantic similarity, in order to answer questions like: Does correlation between statistical datasets imply some kind of semantic relation? Do certain semantic links imply the existence of correlation? We propose a generic workflow for studying whether or not this relation between correlation and similarity holds for arbitrary LSD. We aim at generic correlation and similarity measures, and our workflow enables the use of any correlation and similarity indicators. For the specific goal of this paper, though, we stick to the use of Kendall's correlation coefficient and Latent Semantic Analysis (LSA) similarity. Workflow Based on preliminary experimentation from data acquisition to analysis, we have created the LSD Sense workflow: Create hypothesis Determine datasets and configurations Get metadata of datasets. Get each dataset's observations. Create correlations and other analysis for each dataset pair combination. Create dataset metadata subset for semantic similarity. Create semantic similarity for each dataset pair combination. Create correlation and other analysis using variables semantic similarity and correlation of LSD. Test and verify hypothesis. Analysis. Implementation We have an implementation of the LSD Sense workflow which can be used to both, reproduce our experiments, as well as run it on new input datasets. With the exception of determining which datasets to inspect, and the system configuration, LSD Sense is automated. Semantic Correlation: The semantic similarity algorithm is based on a Latent Semantic Index (LSI) [4]. We use the dataset titles to check for their similarity. Essentially, LSI puts each dataset title into a cluster. The number of clusters can be adjusted (default to 200). It remains as an open research question as to what it should be. Generally, research has demonstrated that optimal values depend on the size and nature of the dataset [5]. We use gensim [6] in our Semantic Correlation [7] implementation for LSD Sense. Concerning the quality of the dataset titles, it is possible to come across datasets that differ only by one word e.g., male, female. This potentially lowers the accuracy to differentiate datasets. As mentioned earlier, we removed the attribute information from the dataset titles with the assumption that it reduced noise.

Description
en We first state our research design and hypothesis, then discuss how we employed Linked Statistical Data (LSD) and Semantic Similarity approaches for a workflow in our LSD Sense [3] implementation. Research design Research problem: Why do machines have difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets? Put another way: How can machines uncover interesting correlations? Over this ongoing investigation, we want to uncover some of the fundamental components for measuring and declaring semantic similarity between datasets, in order to better predict relevant strong relationships. Can semantic relatedness between datasets imply statistical correlation of the related data points in the datasets? Hypothesis Given our research question, we would like to propose a viable research hypothesis, followed by our investigation with the null hypothesis: H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity. H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity. We set the significance level to 5% probability. Linked Statistical Data and Semantic Similarity The RDF Data Cube vocabulary does not only allow one to express statistical data in a Web exchangeable format, but also to represent the (semantic) links within those statistical data. This ability poses some new interesting research questions around the relationship between the statistical and semantic relatedness of datasets. We are interested in the interplay of statistical correlation of LSD and their semantic similarity, in order to answer questions like: Does correlation between statistical datasets imply some kind of semantic relation? Do certain semantic links imply the existence of correlation? We propose a generic workflow for studying whether or not this relation between correlation and similarity holds for arbitrary LSD. We aim at generic correlation and similarity measures, and our workflow enables the use of any correlation and similarity indicators. For the specific goal of this paper, though, we stick to the use of Kendall's correlation coefficient and Latent Semantic Analysis (LSA) similarity. Workflow Based on preliminary experimentation from data acquisition to analysis, we have created the LSD Sense workflow: Create hypothesis Determine datasets and configurations Get metadata of datasets. Get each dataset's observations. Create correlations and other analysis for each dataset pair combination. Create dataset metadata subset for semantic similarity. Create semantic similarity for each dataset pair combination. Create correlation and other analysis using variables semantic similarity and correlation of LSD. Test and verify hypothesis. Analysis. Implementation We have an implementation of the LSD Sense workflow which can be used to both, reproduce our experiments, as well as run it on new input datasets. With the exception of determining which datasets to inspect, and the system configuration, LSD Sense is automated. Semantic Correlation: The semantic similarity algorithm is based on a Latent Semantic Index (LSI) [4]. We use the dataset titles to check for their similarity. Essentially, LSI puts each dataset title into a cluster. The number of clusters can be adjusted (default to 200). It remains as an open research question as to what it should be. Generally, research has demonstrated that optimal values depend on the size and nature of the dataset [5]. We use gensim [6] in our Semantic Correlation [7] implementation for LSD Sense. Concerning the quality of the dataset titles, it is possible to come across datasets that differ only by one word e.g., male, female. This potentially lowers the accuracy to differentiate datasets. As mentioned earlier, we removed the attribute information from the dataset titles with the assumption that it reduced noise.
HasPart
Hypothesis
Implementation
Linked Statistical Data and Semantic Similarity
research-design
Workflow
Name
enMethodology

motivation

If we could however improve the situation slightly by having machines present us with only useful correlations from a random mass of correlations, then we can give more of our attention to what is interesting. Hence, our goal is to set a path towards identifying why some variables have a semantic link between them. Before we establish that, our ongoing approach (as outlined in this research and afterwards) will be to refute or cancel out things which may be in disguise for semantic similarity.

Description
enIf we could however improve the situation slightly by having machines present us with only useful correlations from a random mass of correlations, then we can give more of our attention to what is interesting. Hence, our goal is to set a path towards identifying why some variables have a semantic link between them. Before we establish that, our ongoing approach (as outlined in this research and afterwards) will be to refute or cancel out things which may be in disguise for semantic similarity.
Value
<p>On reading the paper, I see you tried to do on a large scale much what I did on a small scale in 2003, namely, to get some mileage out of titles only. I found a way to enhance the semantic content of these particular titles (browser bookmarks). I found the approach to be of value in retrieval. You might like to read my paper on this, "Browser bookmark management with Topic Maps", at</p> <p><a href='http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html'>http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html</a></p> <p>I also played with a very different title clustering method, and you might like reading that work, too:</p> <p>"On-the-fly Clustering As A Novel RDF Query Mechanism" at</p> <p>http://conferences.idealliance.org/extreme/html/2004/Passin01/EML2004Passin01.html</p> <p> Tom Passin</p>
Value
<p>Thanks for publishing your negative results; I wish more people did! This article also clearly detail your methodology and hypothesis, in addition to providing source code and data. </p> <p>One thing - intermediate values? (I am a provenance guy after all..) I browsed through the github source, but was unable to find the source data after cleaning/preprocessing - I wanted to verify how much remained after stripping out units etc.</p>
Value
<p>Thanks Stian. The source data was too big to include at GitHub. I could offer it up on this site I suppose.</p> <p>What I'm actually planning to do is simply referring to the analysis and provenance (activity) URIs instead at stats.270a.info.</p> <p>The source data can still be produced by running the scripts (hopefully nothing goes chaos on your end) :)</p>
Value
<p>Sarven asked me to review this work. Here are my comments:</p> <p>Overall, an interesting analysis and direction. It’s also of compelling in terms of the underlying linked data approach to making the article itself. First some thoughts on the presentation of the research itself. I think in general the key issue I had is being more precise about the research goals.</p> <p>The key question I had was what constitutes interesting/useful correlations? This is central to the paper and I’m not clear what this means. </p> <ul><li>“The ongoing research problem is, hence, to investigate why machines have a difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets.” -&gt; How is this a research problem?</li><li>“refute or cancel out things which may be in disguise for semantic similarity.” - what does this mean?</li><li>Why don’t you state your research problem using the more simple version: How can machines uncover interesting correlations?</li><li>I don’t know what Hypothesis 1 means? Aren’t you just trying to say that your hypothesis is that “Semantic similarity is a good predictor of meaningful correlations”.</li><li>“Linked Statistical Data and Semantic Similarity” - is lucid and probably a better introduction to what you want to do than what came before it.</li></ul> <p> Thoughts on reproducibility</p> <ul><li>It would be nice to get some interactions on what to run.</li><li>Maybe some setup or pull scripts? Would love to just be a able to git clone and run… I’m a fan of vagrant or other lightweight VMs</li><li>What are your thoughts on supporting content negotiation?</li><li>Could you link from the workflow processes to the github code?</li></ul> <p>Stuff I can do with your paper which is cool:</p> <p>Assume:</p> <p><code>rapper -i rdfa https://csarven.ca/sense-of-lsd-analysis &gt; lsd.nt</code></p> <p>Find we what the problem statement is:</p> <p><code>roqet -i sparql -e "select * where {?s &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt; &lt;http://purl.org/spar/deo/ProblemStatement&gt;}" -D lsd.nt</code></p> <p>List the steps in the research:</p> <p><code>select * where {?s &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt; &lt;http://www.opmw.org/ontology/WorkflowTemplateProcess&gt;</code></p> <p>List the contributors</p> <p><code>roqet -i sparql -e "select * where {?s &lt;http://purl.org/dc/terms/contributor&gt; ?o}" -D lsd.nt</code></p> <p>Unfortunately, they don’t all have rdfs:labels.</p> <p>Overall, it really shows what you can do with these technologies.</p> <p>Layout</p> <ul><li>Section numbering would be nice.</li> <li>Linking to sections. It would be nice to provide some winerlinks.</li><li>It would be nice if I could click on the figures to zoom in to them and also link directly back to the directory.</li><li>I would ensure that the datasets are identified and referenced (if possible) from the text.</li></ul> <p> Minor Comments</p> <ul><li>“to discover” —&gt; in discovering</li><li>“before hand” —&gt; “beforehand”</li><li>“The RDF Data Cube vocabulary does not only allow to express statistical data in a Web exchangeable format” -&gt; “not only allow one to…”</li></ul>
Value
<p>The basic idea in the paper is good. Determining the effectiveness of links in LOD is very important. I like the way the paper is written; the style is informal. The main conclusion, though negative, is important. In fact, too few papers advance our understanding by reporting negative results. This paper is a nice exception.</p> <p>I found the procedure, written in the abstract, hard to follow. In addition, there were few concrete examples of the results of the procedure to illustrate the process. It seems to me an example illustrating each step of the process needs to be added. I expect this may add one or at most 2 to the length, certainly a tenable addition.</p> <p>The introductory joke at the start of the paper was a welcome surprise. Usually, authors sound so serious, and I found this style made me want to continue. Unfortunately, I found the presentation a little rough. In addition to the lack of example cited above, some areas need a sentence or two more explanation. For instance, LSI may not be known to all readers. The reference is good, but if you say a sentence about what it does or accomplishes, why it is relevant to the study, then that puts the unfamiliar reader at ease. (S)he can check the reference later. Otherwise (s)he is forced to in order to continue reading. Similarly, LSD Sense is not defined. The reference indicates there is code on a Github site. You need to devote a paragraph to what this (LSD Sense) is and does.</p> <p>Relevant also to the lack of examples, it is not at all clear how the elements in the Data Cube vocabulary are used in the experiment. Similarly, the World Bank list of indicators data sets is not spelled out or illustrated.</p> <p>Basically, you looked at data set names. Give some examples!</p> <p>What are the reference areas, reference periods, and values you considered? These are left to the reader to guess.</p> <p>You need to have a full set of examples illustrating the problem, the procedures, and the results.</p>
Value
<p>The subject of the paper, exploring dataset titles for similary of datasets, seem to be questionable. Looking at titles - for comparison purposes - is a very poor means. If this is done, then it would only make sense, if nothing else is available. But this was not mentioned in the paper.</p> <p>A dataset, which claims to have some value, has more information than just a title. It has a dataset description and much more descriptive information. Here, data in the form of data cubes are explored. Then definitions of dimensions and code lists are available. Exploring these definitions and codelists has much more value for comparison purposes. The whole exercise of the paper should be applied to this area.</p> <p>Comparison is usually mainly done on the basis of similar or equal definitions of variables (microdata) or dimensions(aggregate data).</p> <p>Some details: Statistical and semantic comparability are mentioned but not clearly described what it is meant in this context.</p> <p>Regarding the subject: I would like just to underline that a dataset, which claims to have some value, has more information than just a title. Then this additional information should be used for any usage.</p> <p>Using datasets (especially in comparison) just according to titles without having information available like a description of the data or definitions of the columns/variables is questionable. It doesn't seem to make sense because the data is generated to describe something. But with poor information on the data one doesn't know what the data describes. That means one does speculations. Does this make sense? Comparison of data is a rather complicated animal. It is necessary to understand the data. This could be the basis for comparison.</p> <p>Comparison approaches should really more focus on descriptions/definitions of the data respectively the columns/variables which are the compounds of the data.</p> <p>Any work on just the titles seems to focus on something which can be done indeed but it doesn't focus on something which really makes sense in a general way. This type of work could only make sense if one has no other information available than titles. Then this approach can be used to decipher the object of which the data describes. But one has to be aware that even then the understanding is very limited.</p>
Value
<p>Semantic similarity and statistical correlation is an age-old problem, and it is an interesting idea to tackle it in the linked data context. But there are so many pitfalls in this area that you have to be very precise and careful. The problem here is that I simply cannot understand the formulation of the work hypotheses in 2.2. The notion of "semantic similarity between datasets" should be defined, because it makes no sense to me.</p> <p>The description of the workflow is too sketchy, and the results section is too short (and the figures not legible.)</p> <p>In short: interesting topic, but the paper is not finished. I will be happy to give a better evaluation when it is more complete.</p>
Value
<p>I think there are two main problems, why the reviews weren't that great. The first is the Hypothesis is unclear/semantically wrong "If the absence of semantically rich connection between datasets is inadequate". There is some double negation in there which I don't understand and which I think non of the reviewers understood (even though we can guess what you meant). The second problem was the lack of examples. Dataset titles is broad and you did not provide a definition of what are your titles (sub-datasets are included?) and what did you use, the labels, the URI, and why not a bit more than just the titles? It may have been more meaningful to extract some keywords from the title, the description and especially the DTDs. Just on the titles seems a bit restrictive. I know it is a workshop paper, but the reviewers probably thought from the beginning that based on the titles there wouldn't be much overlap/correlation.</p> <p>I really like the idea, though, and even though the last reviewer pointed out that semantic similiartiy and statistical correlation is an age-old problem, I have not seen a single paper doing that in our domain. What if you enlarge the correlation analysis with the DTDs. I think that would make it a lot more convincing and might even result in not rejecting your null hypothesis. Also, why not doing it on more datasets? You probably don't have enough datasets in one domain, I guess? Happy to contribute to that. Also, including some of the ABS datasets might make it even better.</p>
Value
<p>I'm confused by your hypothesis test. You specify H0 as "There exists a significant relationship" etc. You did find a strong statistical relationship (p-value &lt; 2.2e-16). You then comment: "Given that both experiments resulted in p-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis."</p> <p>I have never seen a hypothesis test like that. What would it have taken to not reject the hypothesis?</p> <p>Regarding your research question (as far as I understand it), I find it difficult to make sense of. Why should we care whether correlated datasets have similar titles? Sure, it wouldn't be surprising to find that male and female literacy rates are correlated, but then what? What non-obvious insight do we derive from that observation? On the other hand, if we want to causally explain those literacy rates and why they differ for different regions/demographics, we would surely have to investigate variables related to economic and social development and there is no reason whatsoever to expect that the relevant explanatory variables will be semantically similar to the variables we attempt to explain.</p>
Value
<p>Addendum: It is also bewildering that several commenters describe this as a "negative result". In standard statistical terminology, rejection of the null hypothesis is a positive (i. e. interesting, unexpected) result.</p>

Paul Groth

Image
5f6f9837f2cbc51970173e1dec7ccce4?size=48
Name
enPaul Groth
Url
Paul Groth

Paul Groth replied on 2014-08-08 14:44:34

<div typeof='oa:TextualBody' resource='#note-20140808144434' property='rdf:value' datatype='rdf:HTML'> <p>Sarven asked me to review this work. Here are my comments:</p> <p>Overall, an interesting analysis and direction. It’s also of compelling in terms of the underlying linked data approach to making the article itself. First some thoughts on the presentation of the research itself. I think in general the key issue I had is being more precise about the research goals.</p> <p>The key question I had was what constitutes interesting/useful correlations? This is central to the paper and I’m not clear what this means. </p> <ul><li>“The ongoing research problem is, hence, to investigate why machines have a difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets.” -&gt; How is this a research problem?</li><li>“refute or cancel out things which may be in disguise for semantic similarity.” - what does this mean?</li><li>Why don’t you state your research problem using the more simple version: How can machines uncover interesting correlations?</li><li>I don’t know what Hypothesis 1 means? Aren’t you just trying to say that your hypothesis is that “Semantic similarity is a good predictor of meaningful correlations”.</li><li>“Linked Statistical Data and Semantic Similarity” - is lucid and probably a better introduction to what you want to do than what came before it.</li></ul> <p> Thoughts on reproducibility</p> <ul><li>It would be nice to get some interactions on what to run.</li><li>Maybe some setup or pull scripts? Would love to just be a able to git clone and run… I’m a fan of vagrant or other lightweight VMs</li><li>What are your thoughts on supporting content negotiation?</li><li>Could you link from the workflow processes to the github code?</li></ul> <p>Stuff I can do with your paper which is cool:</p> <p>Assume:</p> <p><code>rapper -i rdfa https://csarven.ca/sense-of-lsd-analysis &gt; lsd.nt</code></p> <p>Find we what the problem statement is:</p> <p><code>roqet -i sparql -e "select * where {?s &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt; &lt;http://purl.org/spar/deo/ProblemStatement&gt;}" -D lsd.nt</code></p> <p>List the steps in the research:</p> <p><code>select * where {?s &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt; &lt;http://www.opmw.org/ontology/WorkflowTemplateProcess&gt;</code></p> <p>List the contributors</p> <p><code>roqet -i sparql -e "select * where {?s &lt;http://purl.org/dc/terms/contributor&gt; ?o}" -D lsd.nt</code></p> <p>Unfortunately, they don’t all have rdfs:labels.</p> <p>Overall, it really shows what you can do with these technologies.</p> <p>Layout</p> <ul><li>Section numbering would be nice.</li> <li>Linking to sections. It would be nice to provide some winerlinks.</li><li>It would be nice if I could click on the figures to zoom in to them and also link directly back to the directory.</li><li>I would ensure that the datasets are identified and referenced (if possible) from the text.</li></ul> <p> Minor Comments</p> <ul><li>“to discover” —&gt; in discovering</li><li>“before hand” —&gt; “beforehand”</li><li>“The RDF Data Cube vocabulary does not only allow to express statistical data in a Web exchangeable format” -&gt; “not only allow one to…”</li></ul></div>

Actor
Paul Groth
AnnotatedAt
8 August 2014 16:44
AnnotatedBy
Paul Groth
Content
<div typeof='oa:TextualBody' resource='#note-20140808144434' property='rdf:value' datatype='rdf:HTML'> <p>Sarven asked me to review this work. Here are my comments:</p> <p>Overall, an interesting analysis and direction. It’s also of compelling in terms of the underlying linked data approach to making the article itself. First some thoughts on the presentation of the research itself. I think in general the key issue I had is being more precise about the research goals.</p> <p>The key question I had was what constitutes interesting/useful correlations? This is central to the paper and I’m not clear what this means. </p> <ul><li>“The ongoing research problem is, hence, to investigate why machines have a difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets.” -&gt; How is this a research problem?</li><li>“refute or cancel out things which may be in disguise for semantic similarity.” - what does this mean?</li><li>Why don’t you state your research problem using the more simple version: How can machines uncover interesting correlations?</li><li>I don’t know what Hypothesis 1 means? Aren’t you just trying to say that your hypothesis is that “Semantic similarity is a good predictor of meaningful correlations”.</li><li>“Linked Statistical Data and Semantic Similarity” - is lucid and probably a better introduction to what you want to do than what came before it.</li></ul> <p> Thoughts on reproducibility</p> <ul><li>It would be nice to get some interactions on what to run.</li><li>Maybe some setup or pull scripts? Would love to just be a able to git clone and run… I’m a fan of vagrant or other lightweight VMs</li><li>What are your thoughts on supporting content negotiation?</li><li>Could you link from the workflow processes to the github code?</li></ul> <p>Stuff I can do with your paper which is cool:</p> <p>Assume:</p> <p><code>rapper -i rdfa https://csarven.ca/sense-of-lsd-analysis &gt; lsd.nt</code></p> <p>Find we what the problem statement is:</p> <p><code>roqet -i sparql -e "select * where {?s &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt; &lt;http://purl.org/spar/deo/ProblemStatement&gt;}" -D lsd.nt</code></p> <p>List the steps in the research:</p> <p><code>select * where {?s &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt; &lt;http://www.opmw.org/ontology/WorkflowTemplateProcess&gt;</code></p> <p>List the contributors</p> <p><code>roqet -i sparql -e "select * where {?s &lt;http://purl.org/dc/terms/contributor&gt; ?o}" -D lsd.nt</code></p> <p>Unfortunately, they don’t all have rdfs:labels.</p> <p>Overall, it really shows what you can do with these technologies.</p> <p>Layout</p> <ul><li>Section numbering would be nice.</li> <li>Linking to sections. It would be nice to provide some winerlinks.</li><li>It would be nice if I could click on the figures to zoom in to them and also link directly back to the directory.</li><li>I would ensure that the datasets are identified and referenced (if possible) from the text.</li></ul> <p> Minor Comments</p> <ul><li>“to discover” —&gt; in discovering</li><li>“before hand” —&gt; “beforehand”</li><li>“The RDF Data Cube vocabulary does not only allow to express statistical data in a Web exchangeable format” -&gt; “not only allow one to…”</li></ul></div>
Creator
Paul Groth
DatePublished
8 August 2014 16:44
Description
<div typeof='oa:TextualBody' resource='#note-20140808144434' property='rdf:value' datatype='rdf:HTML'> <p>Sarven asked me to review this work. Here are my comments:</p> <p>Overall, an interesting analysis and direction. It’s also of compelling in terms of the underlying linked data approach to making the article itself. First some thoughts on the presentation of the research itself. I think in general the key issue I had is being more precise about the research goals.</p> <p>The key question I had was what constitutes interesting/useful correlations? This is central to the paper and I’m not clear what this means. </p> <ul><li>“The ongoing research problem is, hence, to investigate why machines have a difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets.” -&gt; How is this a research problem?</li><li>“refute or cancel out things which may be in disguise for semantic similarity.” - what does this mean?</li><li>Why don’t you state your research problem using the more simple version: How can machines uncover interesting correlations?</li><li>I don’t know what Hypothesis 1 means? Aren’t you just trying to say that your hypothesis is that “Semantic similarity is a good predictor of meaningful correlations”.</li><li>“Linked Statistical Data and Semantic Similarity” - is lucid and probably a better introduction to what you want to do than what came before it.</li></ul> <p> Thoughts on reproducibility</p> <ul><li>It would be nice to get some interactions on what to run.</li><li>Maybe some setup or pull scripts? Would love to just be a able to git clone and run… I’m a fan of vagrant or other lightweight VMs</li><li>What are your thoughts on supporting content negotiation?</li><li>Could you link from the workflow processes to the github code?</li></ul> <p>Stuff I can do with your paper which is cool:</p> <p>Assume:</p> <p><code>rapper -i rdfa https://csarven.ca/sense-of-lsd-analysis &gt; lsd.nt</code></p> <p>Find we what the problem statement is:</p> <p><code>roqet -i sparql -e "select * where {?s &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt; &lt;http://purl.org/spar/deo/ProblemStatement&gt;}" -D lsd.nt</code></p> <p>List the steps in the research:</p> <p><code>select * where {?s &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#type&gt; &lt;http://www.opmw.org/ontology/WorkflowTemplateProcess&gt;</code></p> <p>List the contributors</p> <p><code>roqet -i sparql -e "select * where {?s &lt;http://purl.org/dc/terms/contributor&gt; ?o}" -D lsd.nt</code></p> <p>Unfortunately, they don’t all have rdfs:labels.</p> <p>Overall, it really shows what you can do with these technologies.</p> <p>Layout</p> <ul><li>Section numbering would be nice.</li> <li>Linking to sections. It would be nice to provide some winerlinks.</li><li>It would be nice if I could click on the figures to zoom in to them and also link directly back to the directory.</li><li>I would ensure that the datasets are identified and referenced (if possible) from the text.</li></ul> <p> Minor Comments</p> <ul><li>“to discover” —&gt; in discovering</li><li>“before hand” —&gt; “beforehand”</li><li>“The RDF Data Cube vocabulary does not only allow to express statistical data in a Web exchangeable format” -&gt; “not only allow one to…”</li></ul></div>
HasBody
note-20140808144434
HasTarget
Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo
Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy
replying
Name
enPaul Groth replied on 2014-08-08 14:44:34
Published
8 August 2014 16:44

prologue

There was this American who was afraid of a heart attack and he found out that the Japanese ate very little fat and almost did not drink wine but they had much less heart attacks than the Americans. But on the other hand he also found out that the French eat as much fat as the Americans and they drink much more wine but they also have less heart attacks. So he concluded that what kills you is speaking English

Description
enThere was this American who was afraid of a heart attack and he found out that the Japanese ate very little fat and almost did not drink wine but they had much less heart attacks than the Americans. But on the other hand he also found out that the French eat as much fat as the Americans and they drink much more wine but they also have less heart attacks. So he concluded that what kills you is speaking English

research-design

<p><strong>Research problem</strong>: Why do machines have difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets? Put another way: How can machines uncover interesting correlations?</p> <p>Over this ongoing investigation, we want to uncover some of the fundamental components for measuring and declaring semantic similarity between datasets, in order to better predict relevant strong relationships. Can semantic relatedness between datasets imply statistical correlation of the related data points in the datasets?</p>

Description
<p><strong>Research problem</strong>: Why do machines have difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets? Put another way: How can machines uncover interesting correlations?</p> <p>Over this ongoing investigation, we want to uncover some of the fundamental components for measuring and declaring semantic similarity between datasets, in order to better predict relevant strong relationships. Can semantic relatedness between datasets imply statistical correlation of the related data points in the datasets?</p>
Name
research-design

results

<p>All of the experiment results are available at the <a href='https://github.com/csarven/lsd-sense'>LSD Sense</a> GitHub repository, and can be reproduced. Table [<a href='#experiment-results'>Experiment results</a>] provides our findings, with Figures [<a href='#figure_lsd-sense-worldbank-2012'>1</a>] and [<a href='#figure_lsd-sense-worldbank-2012-topic-4'>2</a>]:</p> <table id='experiment-results'> <caption>Experiment Results</caption> <thead> <tr> <th></th> <th>All topics</th> <th>One topic (<em>education</em>)</th> </tr> </thead> <tfoot> <tr><td colspan='3'>Datasets are from 2012 World Bank indicators. n is the number of dataset pairs with semantic similarity and correlation as variables.</td></tr> </tfoot> <tbody> <tr><th>Correlation</th><td>0.182</td><td>0.227</td></tr> <tr><th><em>p</em>-value</th><td>&lt; 2.2e-16</td><td>&lt; 2.2e-16</td></tr> <tr><th>n</th><td>92819</td><td>33184</td></tr> </tbody> </table> <div class='figure-column-2'> <figure id='figure_lsd-sense-worldbank-2012'> <img alt='Figure of scatter plot showing 2012 World Bank indicators with all topics' height='300' width='300' src='/media/images/articles/lsd-sense-worldbank-2012.png'></img> <figcaption>2012 World Bank indicators with all topics</figcaption> </figure> <figure id='figure_lsd-sense-worldbank-2012-topic-4'> <img alt='Figure of scatter plot showing 2012 World Bank indicators with topic education' height='300' width='300' src='/media/images/articles/lsd-sense-worldbank-2012-4.png'></img> <figcaption>2012 World Bank indicators with topic education.</figcaption> </figure> </div> <p>Given that both experiments resulted in <em>p</em>-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis. For extra measure, we can also verify the meaninglessness by looking at the plots. There is <strong>nothing</strong> interesting <strong>to see here</strong>. We will <strong>move along</strong> with our alternative hypothesis.</p>

Description
<p>All of the experiment results are available at the <a href='https://github.com/csarven/lsd-sense'>LSD Sense</a> GitHub repository, and can be reproduced. Table [<a href='#experiment-results'>Experiment results</a>] provides our findings, with Figures [<a href='#figure_lsd-sense-worldbank-2012'>1</a>] and [<a href='#figure_lsd-sense-worldbank-2012-topic-4'>2</a>]:</p> <table id='experiment-results'> <caption>Experiment Results</caption> <thead> <tr> <th></th> <th>All topics</th> <th>One topic (<em>education</em>)</th> </tr> </thead> <tfoot> <tr><td colspan='3'>Datasets are from 2012 World Bank indicators. n is the number of dataset pairs with semantic similarity and correlation as variables.</td></tr> </tfoot> <tbody> <tr><th>Correlation</th><td>0.182</td><td>0.227</td></tr> <tr><th><em>p</em>-value</th><td>&lt; 2.2e-16</td><td>&lt; 2.2e-16</td></tr> <tr><th>n</th><td>92819</td><td>33184</td></tr> </tbody> </table> <div class='figure-column-2'> <figure id='figure_lsd-sense-worldbank-2012'> <img alt='Figure of scatter plot showing 2012 World Bank indicators with all topics' height='300' width='300' src='/media/images/articles/lsd-sense-worldbank-2012.png'></img> <figcaption>2012 World Bank indicators with all topics</figcaption> </figure> <figure id='figure_lsd-sense-worldbank-2012-topic-4'> <img alt='Figure of scatter plot showing 2012 World Bank indicators with topic education' height='300' width='300' src='/media/images/articles/lsd-sense-worldbank-2012-4.png'></img> <figcaption>2012 World Bank indicators with topic education.</figcaption> </figure> </div> <p>Given that both experiments resulted in <em>p</em>-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis. For extra measure, we can also verify the meaninglessness by looking at the plots. There is <strong>nothing</strong> interesting <strong>to see here</strong>. We will <strong>move along</strong> with our alternative hypothesis.</p>

Sarven Capadisli replied on 2014-07-22 15:11:39

<div typeof='oa:TextualBody' resource='#note-20140722151139' property='rdf:value' datatype='rdf:HTML'> <p>Thanks Stian. The source data was too big to include at GitHub. I could offer it up on this site I suppose.</p> <p>What I'm actually planning to do is simply referring to the analysis and provenance (activity) URIs instead at stats.270a.info.</p> <p>The source data can still be produced by running the scripts (hopefully nothing goes chaos on your end) :)</p></div>

Actor
Sarven Capadisli
AnnotatedAt
22 July 2014 17:11
AnnotatedBy
Sarven Capadisli
Content
<div typeof='oa:TextualBody' resource='#note-20140722151139' property='rdf:value' datatype='rdf:HTML'> <p>Thanks Stian. The source data was too big to include at GitHub. I could offer it up on this site I suppose.</p> <p>What I'm actually planning to do is simply referring to the analysis and provenance (activity) URIs instead at stats.270a.info.</p> <p>The source data can still be produced by running the scripts (hopefully nothing goes chaos on your end) :)</p></div>
Creator
Sarven Capadisli
DatePublished
22 July 2014 17:11
Description
<div typeof='oa:TextualBody' resource='#note-20140722151139' property='rdf:value' datatype='rdf:HTML'> <p>Thanks Stian. The source data was too big to include at GitHub. I could offer it up on this site I suppose.</p> <p>What I'm actually planning to do is simply referring to the analysis and provenance (activity) URIs instead at stats.270a.info.</p> <p>The source data can still be produced by running the scripts (hopefully nothing goes chaos on your end) :)</p></div>
HasBody
note-20140722151139
HasTarget
Stian Soiland-Reyes replied on 2014-07-22 15:04:42
InReplyTo
Stian Soiland-Reyes replied on 2014-07-22 15:04:42
MotivatedBy
replying
Name
enSarven Capadisli replied on 2014-07-22 15:11:39
Published
22 July 2014 17:11
About
Correlation and dependence
Linked Data
Semantic similarity
Statistical database
Statistics
Abstract
en Statistical data is increasingly made available in the form of Linked Data on the Web. As more and more statistical datasets become available, a fundamental question on statistical data comparability arises: To what extent can arbitrary statistical datasets be faithfully compared? Besides a purely statistical comparability, we are interested in the role that semantics plays in the data to be compared. Our hypothesis is that semantic relationships between different components of statistical datasets might have a relationship with their statistical correlation. Our research focuses in studying whether these statistical and semantic relationships influence each other, by comparing the correlation of statistical data with their semantic similarity. The ongoing research problem is, hence, to investigate why machines have a difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets. We describe a fully reproducible pipeline to compare statistical correlation with semantic similarity in arbitrary Linked Statistical Data. We present a use case using World Bank data expressed as RDF Data Cube, and we highlight whether dataset titles can help predict strong correlations.
Author
Reinhard Riedl
Sarven Capadisli
Sören Auer
Citation
310
citation.cfm?id=1458105
hc6.pdf
http://270a.info/
http://orion.tw.rpi.edu/~jimmccusker/qb.js/
http://stats.270a.info/
http://tylervigen.com/
http://worldbank.270a.info/
index.html
ISWC2013 AZ LODSEM public.pdf
linked-statistical-data-analysis
lsa ind.pdf
lsd-sense
SemanticCorrelation
SurveyOMtkde SE.pdf
the-joy-stats-meaningless-and-meaningful-correlations
Creator
Sarven Capadisli
DateModified
21 March 2015 11:00
DatePublished
21 July 2014 11:00
HasPart

A2

Name
results
Acknowledgements
Conclusions and Future Work
Experiment
future-work
Introduction
Methodology
prologue

Related Work

Name
enRelated Work
related-work
results
World Bank's education topic
World Bank indicators
Has reply
Anonymous Reviewer replied on 2014-10-14 01:48:33
Anonymous Reviewer replied on 2014-10-18 00:30:02
Anonymous Reviewer replied on 2014-10-18 00:30:27
Armin Haller replied on 2014-10-18 00:31:28
Paul Groth replied on 2014-08-08 14:44:34
Stian Soiland-Reyes replied on 2014-07-22 15:04:42
TM replied on 2016-01-21 08:59:58
TM replied on 2016-01-21 12:35:44
Tom Passin replied on 2014-07-22 14:22:23
License
https://creativecommons.org/licenses/by/4.0/
Name
enSemantic Similarity and Correlation of Linked Statistical Data Analysis
Num replies
10
Publisher
Sarven Capadisli
Purpose
enMaking “sense” of Linked Statistical Data and Analysis.
RelatedLink
dokieli-rww
linked-data-notifications
linked-specifications-reports
sparqlines-sparql-to-sparkline
web-science-from-404-to-200

Semantic similarity for each dataset pair

<p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p>

Description
<p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p>
Name
enSemantic similarity for each dataset pair
Image
3d3aa57fe25b38e3414c1764c993f3c3?size=48
Name
enStian Soiland-Reyes
Url
Stian Soiland-Reyes

Stian Soiland-Reyes replied on 2014-07-22 15:04:42

<div typeof='oa:TextualBody' resource='#note-20140722150442' property='rdf:value' datatype='rdf:HTML'> <p>Thanks for publishing your negative results; I wish more people did! This article also clearly detail your methodology and hypothesis, in addition to providing source code and data. </p> <p>One thing - intermediate values? (I am a provenance guy after all..) I browsed through the github source, but was unable to find the source data after cleaning/preprocessing - I wanted to verify how much remained after stripping out units etc.</p></div>

Actor
Stian Soiland-Reyes
AnnotatedAt
22 July 2014 17:04
AnnotatedBy
Stian Soiland-Reyes
Content
<div typeof='oa:TextualBody' resource='#note-20140722150442' property='rdf:value' datatype='rdf:HTML'> <p>Thanks for publishing your negative results; I wish more people did! This article also clearly detail your methodology and hypothesis, in addition to providing source code and data. </p> <p>One thing - intermediate values? (I am a provenance guy after all..) I browsed through the github source, but was unable to find the source data after cleaning/preprocessing - I wanted to verify how much remained after stripping out units etc.</p></div>
Creator
Stian Soiland-Reyes
DatePublished
22 July 2014 17:04
Description
<div typeof='oa:TextualBody' resource='#note-20140722150442' property='rdf:value' datatype='rdf:HTML'> <p>Thanks for publishing your negative results; I wish more people did! This article also clearly detail your methodology and hypothesis, in addition to providing source code and data. </p> <p>One thing - intermediate values? (I am a provenance guy after all..) I browsed through the github source, but was unable to find the source data after cleaning/preprocessing - I wanted to verify how much remained after stripping out units etc.</p></div>
HasBody
note-20140722150442
Has reply
Sarven Capadisli replied on 2014-07-22 15:11:39
HasTarget
Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo
Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy
replying
Name
enStian Soiland-Reyes replied on 2014-07-22 15:04:42
Num replies
1
Published
22 July 2014 17:04

Sören Auer

MemberOf
University of Bonn
Name
enSören Auer
Url
http://www.iai.uni-bonn.de/~auer/

TM

Image
87e0f265be9cb8361457f3cb20a9990e?size=48
Name
enTM
Url
TM

TM replied on 2016-01-21 08:59:58

<div typeof='oa:TextualBody' resource='#note-20160121085958' property='rdf:value' datatype='rdf:HTML'> <p>I'm confused by your hypothesis test. You specify H0 as "There exists a significant relationship" etc. You did find a strong statistical relationship (p-value &lt; 2.2e-16). You then comment: "Given that both experiments resulted in p-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis."</p> <p>I have never seen a hypothesis test like that. What would it have taken to not reject the hypothesis?</p> <p>Regarding your research question (as far as I understand it), I find it difficult to make sense of. Why should we care whether correlated datasets have similar titles? Sure, it wouldn't be surprising to find that male and female literacy rates are correlated, but then what? What non-obvious insight do we derive from that observation? On the other hand, if we want to causally explain those literacy rates and why they differ for different regions/demographics, we would surely have to investigate variables related to economic and social development and there is no reason whatsoever to expect that the relevant explanatory variables will be semantically similar to the variables we attempt to explain.</p></div>

Actor
TM
AnnotatedAt
21 January 2016 10:59
AnnotatedBy
TM
Content
<div typeof='oa:TextualBody' resource='#note-20160121085958' property='rdf:value' datatype='rdf:HTML'> <p>I'm confused by your hypothesis test. You specify H0 as "There exists a significant relationship" etc. You did find a strong statistical relationship (p-value &lt; 2.2e-16). You then comment: "Given that both experiments resulted in p-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis."</p> <p>I have never seen a hypothesis test like that. What would it have taken to not reject the hypothesis?</p> <p>Regarding your research question (as far as I understand it), I find it difficult to make sense of. Why should we care whether correlated datasets have similar titles? Sure, it wouldn't be surprising to find that male and female literacy rates are correlated, but then what? What non-obvious insight do we derive from that observation? On the other hand, if we want to causally explain those literacy rates and why they differ for different regions/demographics, we would surely have to investigate variables related to economic and social development and there is no reason whatsoever to expect that the relevant explanatory variables will be semantically similar to the variables we attempt to explain.</p></div>
Creator
TM
DatePublished
21 January 2016 10:59
Description
<div typeof='oa:TextualBody' resource='#note-20160121085958' property='rdf:value' datatype='rdf:HTML'> <p>I'm confused by your hypothesis test. You specify H0 as "There exists a significant relationship" etc. You did find a strong statistical relationship (p-value &lt; 2.2e-16). You then comment: "Given that both experiments resulted in p-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis."</p> <p>I have never seen a hypothesis test like that. What would it have taken to not reject the hypothesis?</p> <p>Regarding your research question (as far as I understand it), I find it difficult to make sense of. Why should we care whether correlated datasets have similar titles? Sure, it wouldn't be surprising to find that male and female literacy rates are correlated, but then what? What non-obvious insight do we derive from that observation? On the other hand, if we want to causally explain those literacy rates and why they differ for different regions/demographics, we would surely have to investigate variables related to economic and social development and there is no reason whatsoever to expect that the relevant explanatory variables will be semantically similar to the variables we attempt to explain.</p></div>
HasBody
note-20160121085958
HasTarget
Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo
Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy
replying
Name
enTM replied on 2016-01-21 08:59:58
Published
21 January 2016 10:59

TM replied on 2016-01-21 12:35:44

<div typeof='oa:TextualBody' resource='#note-20160121123544' property='rdf:value' datatype='rdf:HTML'> <p>Addendum: It is also bewildering that several commenters describe this as a "negative result". In standard statistical terminology, rejection of the null hypothesis is a positive (i. e. interesting, unexpected) result.</p></div>

Actor
TM
AnnotatedAt
21 January 2016 14:35
AnnotatedBy
TM
Content
<div typeof='oa:TextualBody' resource='#note-20160121123544' property='rdf:value' datatype='rdf:HTML'> <p>Addendum: It is also bewildering that several commenters describe this as a "negative result". In standard statistical terminology, rejection of the null hypothesis is a positive (i. e. interesting, unexpected) result.</p></div>
Creator
TM
DatePublished
21 January 2016 14:35
Description
<div typeof='oa:TextualBody' resource='#note-20160121123544' property='rdf:value' datatype='rdf:HTML'> <p>Addendum: It is also bewildering that several commenters describe this as a "negative result". In standard statistical terminology, rejection of the null hypothesis is a positive (i. e. interesting, unexpected) result.</p></div>
HasBody
note-20160121123544
HasTarget
Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo
Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy
replying
Name
enTM replied on 2016-01-21 12:35:44
Published
21 January 2016 14:35
IsStepOfTemplate
Workflow
Label
enTest and verify hypothesis.

Tom Passin

Image
457f60484a1242d410374b7d5a7e3ebf?size=48
Name
enTom Passin

Tom Passin replied on 2014-07-22 14:22:23

<div typeof='oa:TextualBody' resource='#note-20140722142223' property='rdf:value' datatype='rdf:HTML'> <p>On reading the paper, I see you tried to do on a large scale much what I did on a small scale in 2003, namely, to get some mileage out of titles only. I found a way to enhance the semantic content of these particular titles (browser bookmarks). I found the approach to be of value in retrieval. You might like to read my paper on this, "Browser bookmark management with Topic Maps", at</p> <p><a href='http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html'>http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html</a></p> <p>I also played with a very different title clustering method, and you might like reading that work, too:</p> <p>"On-the-fly Clustering As A Novel RDF Query Mechanism" at</p> <p>http://conferences.idealliance.org/extreme/html/2004/Passin01/EML2004Passin01.html</p> <p> Tom Passin</p></div>

AnnotatedAt
22 July 2014 16:22
Content
<div typeof='oa:TextualBody' resource='#note-20140722142223' property='rdf:value' datatype='rdf:HTML'> <p>On reading the paper, I see you tried to do on a large scale much what I did on a small scale in 2003, namely, to get some mileage out of titles only. I found a way to enhance the semantic content of these particular titles (browser bookmarks). I found the approach to be of value in retrieval. You might like to read my paper on this, "Browser bookmark management with Topic Maps", at</p> <p><a href='http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html'>http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html</a></p> <p>I also played with a very different title clustering method, and you might like reading that work, too:</p> <p>"On-the-fly Clustering As A Novel RDF Query Mechanism" at</p> <p>http://conferences.idealliance.org/extreme/html/2004/Passin01/EML2004Passin01.html</p> <p> Tom Passin</p></div>
Creator
enTom Passin
DatePublished
22 July 2014 16:22
Description
<div typeof='oa:TextualBody' resource='#note-20140722142223' property='rdf:value' datatype='rdf:HTML'> <p>On reading the paper, I see you tried to do on a large scale much what I did on a small scale in 2003, namely, to get some mileage out of titles only. I found a way to enhance the semantic content of these particular titles (browser bookmarks). I found the approach to be of value in retrieval. You might like to read my paper on this, "Browser bookmark management with Topic Maps", at</p> <p><a href='http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html'>http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html</a></p> <p>I also played with a very different title clustering method, and you might like reading that work, too:</p> <p>"On-the-fly Clustering As A Novel RDF Query Mechanism" at</p> <p>http://conferences.idealliance.org/extreme/html/2004/Passin01/EML2004Passin01.html</p> <p> Tom Passin</p></div>
HasBody
note-20140722142223
HasTarget
Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo
Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy
replying
Name
enTom Passin replied on 2014-07-22 14:22:23
Published
22 July 2014 16:22
Name
enUniversity of Bonn
Url
http://uni-bonn.de/
Name
enVU University Amsterdam
Url
http://vu.nl/

Workflow

<p>Based on preliminary experimentation from data acquisition to analysis, we have created the <span property='rdfs:label'>LSD Sense workflow</span>:</p> <ol> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-hypothesis' id='workflow-create-hypothesis'>Create hypothesis</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-configure' id='workflow-configure'>Determine datasets and configurations</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-get-metadata-lsd' id='workflow-get-metadata-lsd'>Get metadata of datasets.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-get-observations-lsd' id='workflow-get-observations-lsd'>Get each dataset's observations.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-lsd' id='workflow-create-analysis-lsd'>Create correlations and other analysis for each dataset pair combination.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-preprocess-semantic-similarity' id='workflow-create-preprocess-semantic-similarity'>Create dataset metadata subset for semantic similarity.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-semantic-similarity' id='workflow-create-analysis-semantic-similarity'>Create semantic similarity for each dataset pair combination.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-semantic-similarity-correlation' id='workflow-create-analysis-semantic-similarity-correlation'>Create correlation and other analysis using variables semantic similarity and correlation of LSD.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-test-verify-hypothesis' id='workflow-test-verify-hypothesis'>Test and verify hypothesis.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-analysis' id='workflow-analysis'>Analysis.</li> </ol>

Description
<p>Based on preliminary experimentation from data acquisition to analysis, we have created the <span property='rdfs:label'>LSD Sense workflow</span>:</p> <ol> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-hypothesis' id='workflow-create-hypothesis'>Create hypothesis</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-configure' id='workflow-configure'>Determine datasets and configurations</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-get-metadata-lsd' id='workflow-get-metadata-lsd'>Get metadata of datasets.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-get-observations-lsd' id='workflow-get-observations-lsd'>Get each dataset's observations.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-lsd' id='workflow-create-analysis-lsd'>Create correlations and other analysis for each dataset pair combination.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-preprocess-semantic-similarity' id='workflow-create-preprocess-semantic-similarity'>Create dataset metadata subset for semantic similarity.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-semantic-similarity' id='workflow-create-analysis-semantic-similarity'>Create semantic similarity for each dataset pair combination.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-semantic-similarity-correlation' id='workflow-create-analysis-semantic-similarity-correlation'>Create correlation and other analysis using variables semantic similarity and correlation of LSD.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-test-verify-hypothesis' id='workflow-test-verify-hypothesis'>Test and verify hypothesis.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-analysis' id='workflow-analysis'>Analysis.</li> </ol>
Label
enLSD Sense workflow
Name
enWorkflow
Label
enWorld Bank indicators
SeeAlso
world-bank-indicators
Label
enWorld Bank's education topic
SeeAlso
4