Semantic Similarity and Correlation of Linked Statistical Data Analysis

Acknowledgements

<p>This work was supported by a STSM Grant from the <a href='http://www.cost.eu/domains_actions/mpns/Actions/TD1210'>COST Action TD1210</a>. Many thanks to colleagues whom helped one way or another during the course of this work (not implying any endorsement); in no particular order: <a href='http://bosamber.wordpress.com/'>Amber van den Bos</a> (Dakiroa), <a href='http://www.wirtschaft.bfh.ch/de/ueber_uns/kontakt/detailseite.html?tx_bfhpersonalpages_p=mam10&tx_bfhpersonalpages_screen=data'>Michael Mosimann</a> (BFS), <a href='http://nl.linkedin.com/pub/anton-heijs/1/489/861'>Anton Heijs</a> (Treparel b.v.), <a href='http://en.wikipedia.org/wiki/Frank_van_Harmelen'>Frank van Harmelen</a> (VU Amsterdam).</p>

Acknowledgements

Description: <p>This work was supported by a STSM Grant from the <a href='http://www.cost.eu/domains_actions/mpns/Actions/TD1210'>COST Action TD1210</a>. Many thanks to colleagues whom helped one way or another during the course of this work (not implying any endorsement); in no particular order: <a href='http://bosamber.wordpress.com/'>Amber van den Bos</a> (Dakiroa), <a href='http://www.wirtschaft.bfh.ch/de/ueber_uns/kontakt/detailseite.html?tx_bfhpersonalpages_p=mam10&tx_bfhpersonalpages_screen=data'>Michael Mosimann</a> (BFS), <a href='http://nl.linkedin.com/pub/anton-heijs/1/489/861'>Anton Heijs</a> (Treparel b.v.), <a href='http://en.wikipedia.org/wiki/Frank_van_Harmelen'>Frank van Harmelen</a> (VU Amsterdam).</p>
Name: enAcknowledgements

Analysis.

WorkflowTemplateProcess

IsStepOfTemplate: Workflow
Label: enAnalysis.

Anonymous Reviewer

Person

Name: enAnonymous Reviewer

Anonymous Reviewer

Person

Name: enAnonymous Reviewer

Anonymous Reviewer

Person

Name: enAnonymous Reviewer

Anonymous Reviewer replied on 2014-10-14 01:48:33

<div typeof='oa:TextualBody' resource='#note-20141014014833' property='rdf:value' datatype='rdf:HTML'> <p>The basic idea in the paper is good. Determining the effectiveness of links in LOD is very important. I like the way the paper is written; the style is informal. The main conclusion, though negative, is important. In fact, too few papers advance our understanding by reporting negative results. This paper is a nice exception.</p> <p>I found the procedure, written in the abstract, hard to follow. In addition, there were few concrete examples of the results of the procedure to illustrate the process. It seems to me an example illustrating each step of the process needs to be added. I expect this may add one or at most 2 to the length, certainly a tenable addition.</p> <p>The introductory joke at the start of the paper was a welcome surprise. Usually, authors sound so serious, and I found this style made me want to continue. Unfortunately, I found the presentation a little rough. In addition to the lack of example cited above, some areas need a sentence or two more explanation. For instance, LSI may not be known to all readers. The reference is good, but if you say a sentence about what it does or accomplishes, why it is relevant to the study, then that puts the unfamiliar reader at ease. (S)he can check the reference later. Otherwise (s)he is forced to in order to continue reading. Similarly, LSD Sense is not defined. The reference indicates there is code on a Github site. You need to devote a paragraph to what this (LSD Sense) is and does.</p> <p>Relevant also to the lack of examples, it is not at all clear how the elements in the Data Cube vocabulary are used in the experiment. Similarly, the World Bank list of indicators data sets is not spelled out or illustrated.</p> <p>Basically, you looked at data set names. Give some examples!</p> <p>What are the reference areas, reference periods, and values you considered? These are left to the reader to guess.</p> <p>You need to have a full set of examples illustrating the problem, the procedures, and the results.</p></div>

AnnotatedAt: 14 October 2014 03:48
Content: <div typeof='oa:TextualBody' resource='#note-20141014014833' property='rdf:value' datatype='rdf:HTML'> <p>The basic idea in the paper is good. Determining the effectiveness of links in LOD is very important. I like the way the paper is written; the style is informal. The main conclusion, though negative, is important. In fact, too few papers advance our understanding by reporting negative results. This paper is a nice exception.</p> <p>I found the procedure, written in the abstract, hard to follow. In addition, there were few concrete examples of the results of the procedure to illustrate the process. It seems to me an example illustrating each step of the process needs to be added. I expect this may add one or at most 2 to the length, certainly a tenable addition.</p> <p>The introductory joke at the start of the paper was a welcome surprise. Usually, authors sound so serious, and I found this style made me want to continue. Unfortunately, I found the presentation a little rough. In addition to the lack of example cited above, some areas need a sentence or two more explanation. For instance, LSI may not be known to all readers. The reference is good, but if you say a sentence about what it does or accomplishes, why it is relevant to the study, then that puts the unfamiliar reader at ease. (S)he can check the reference later. Otherwise (s)he is forced to in order to continue reading. Similarly, LSD Sense is not defined. The reference indicates there is code on a Github site. You need to devote a paragraph to what this (LSD Sense) is and does.</p> <p>Relevant also to the lack of examples, it is not at all clear how the elements in the Data Cube vocabulary are used in the experiment. Similarly, the World Bank list of indicators data sets is not spelled out or illustrated.</p> <p>Basically, you looked at data set names. Give some examples!</p> <p>What are the reference areas, reference periods, and values you considered? These are left to the reader to guess.</p> <p>You need to have a full set of examples illustrating the problem, the procedures, and the results.</p></div>
Creator: enAnonymous Reviewer
DatePublished: 14 October 2014 03:48
Description: <div typeof='oa:TextualBody' resource='#note-20141014014833' property='rdf:value' datatype='rdf:HTML'> <p>The basic idea in the paper is good. Determining the effectiveness of links in LOD is very important. I like the way the paper is written; the style is informal. The main conclusion, though negative, is important. In fact, too few papers advance our understanding by reporting negative results. This paper is a nice exception.</p> <p>I found the procedure, written in the abstract, hard to follow. In addition, there were few concrete examples of the results of the procedure to illustrate the process. It seems to me an example illustrating each step of the process needs to be added. I expect this may add one or at most 2 to the length, certainly a tenable addition.</p> <p>The introductory joke at the start of the paper was a welcome surprise. Usually, authors sound so serious, and I found this style made me want to continue. Unfortunately, I found the presentation a little rough. In addition to the lack of example cited above, some areas need a sentence or two more explanation. For instance, LSI may not be known to all readers. The reference is good, but if you say a sentence about what it does or accomplishes, why it is relevant to the study, then that puts the unfamiliar reader at ease. (S)he can check the reference later. Otherwise (s)he is forced to in order to continue reading. Similarly, LSD Sense is not defined. The reference indicates there is code on a Github site. You need to devote a paragraph to what this (LSD Sense) is and does.</p> <p>Relevant also to the lack of examples, it is not at all clear how the elements in the Data Cube vocabulary are used in the experiment. Similarly, the World Bank list of indicators data sets is not spelled out or illustrated.</p> <p>Basically, you looked at data set names. Give some examples!</p> <p>What are the reference areas, reference periods, and values you considered? These are left to the reader to guess.</p> <p>You need to have a full set of examples illustrating the problem, the procedures, and the results.</p></div>
HasBody: note-20141014014833
HasTarget: Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo: Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy: replying
Name: enAnonymous Reviewer replied on 2014-10-14 01:48:33
Published: 14 October 2014 03:48

Anonymous Reviewer replied on 2014-10-18 00:30:02

<div typeof='oa:TextualBody' resource='#note-20141018003002' property='rdf:value' datatype='rdf:HTML'> <p>The subject of the paper, exploring dataset titles for similary of datasets, seem to be questionable. Looking at titles - for comparison purposes - is a very poor means. If this is done, then it would only make sense, if nothing else is available. But this was not mentioned in the paper.</p> <p>A dataset, which claims to have some value, has more information than just a title. It has a dataset description and much more descriptive information. Here, data in the form of data cubes are explored. Then definitions of dimensions and code lists are available. Exploring these definitions and codelists has much more value for comparison purposes. The whole exercise of the paper should be applied to this area.</p> <p>Comparison is usually mainly done on the basis of similar or equal definitions of variables (microdata) or dimensions(aggregate data).</p> <p>Some details: Statistical and semantic comparability are mentioned but not clearly described what it is meant in this context.</p> <p>Regarding the subject: I would like just to underline that a dataset, which claims to have some value, has more information than just a title. Then this additional information should be used for any usage.</p> <p>Using datasets (especially in comparison) just according to titles without having information available like a description of the data or definitions of the columns/variables is questionable. It doesn't seem to make sense because the data is generated to describe something. But with poor information on the data one doesn't know what the data describes. That means one does speculations. Does this make sense? Comparison of data is a rather complicated animal. It is necessary to understand the data. This could be the basis for comparison.</p> <p>Comparison approaches should really more focus on descriptions/definitions of the data respectively the columns/variables which are the compounds of the data.</p> <p>Any work on just the titles seems to focus on something which can be done indeed but it doesn't focus on something which really makes sense in a general way. This type of work could only make sense if one has no other information available than titles. Then this approach can be used to decipher the object of which the data describes. But one has to be aware that even then the understanding is very limited.</p></div>

AnnotatedAt: 18 October 2014 02:30
Content: <div typeof='oa:TextualBody' resource='#note-20141018003002' property='rdf:value' datatype='rdf:HTML'> <p>The subject of the paper, exploring dataset titles for similary of datasets, seem to be questionable. Looking at titles - for comparison purposes - is a very poor means. If this is done, then it would only make sense, if nothing else is available. But this was not mentioned in the paper.</p> <p>A dataset, which claims to have some value, has more information than just a title. It has a dataset description and much more descriptive information. Here, data in the form of data cubes are explored. Then definitions of dimensions and code lists are available. Exploring these definitions and codelists has much more value for comparison purposes. The whole exercise of the paper should be applied to this area.</p> <p>Comparison is usually mainly done on the basis of similar or equal definitions of variables (microdata) or dimensions(aggregate data).</p> <p>Some details: Statistical and semantic comparability are mentioned but not clearly described what it is meant in this context.</p> <p>Regarding the subject: I would like just to underline that a dataset, which claims to have some value, has more information than just a title. Then this additional information should be used for any usage.</p> <p>Using datasets (especially in comparison) just according to titles without having information available like a description of the data or definitions of the columns/variables is questionable. It doesn't seem to make sense because the data is generated to describe something. But with poor information on the data one doesn't know what the data describes. That means one does speculations. Does this make sense? Comparison of data is a rather complicated animal. It is necessary to understand the data. This could be the basis for comparison.</p> <p>Comparison approaches should really more focus on descriptions/definitions of the data respectively the columns/variables which are the compounds of the data.</p> <p>Any work on just the titles seems to focus on something which can be done indeed but it doesn't focus on something which really makes sense in a general way. This type of work could only make sense if one has no other information available than titles. Then this approach can be used to decipher the object of which the data describes. But one has to be aware that even then the understanding is very limited.</p></div>
Creator: enAnonymous Reviewer
DatePublished: 18 October 2014 02:30
Description: <div typeof='oa:TextualBody' resource='#note-20141018003002' property='rdf:value' datatype='rdf:HTML'> <p>The subject of the paper, exploring dataset titles for similary of datasets, seem to be questionable. Looking at titles - for comparison purposes - is a very poor means. If this is done, then it would only make sense, if nothing else is available. But this was not mentioned in the paper.</p> <p>A dataset, which claims to have some value, has more information than just a title. It has a dataset description and much more descriptive information. Here, data in the form of data cubes are explored. Then definitions of dimensions and code lists are available. Exploring these definitions and codelists has much more value for comparison purposes. The whole exercise of the paper should be applied to this area.</p> <p>Comparison is usually mainly done on the basis of similar or equal definitions of variables (microdata) or dimensions(aggregate data).</p> <p>Some details: Statistical and semantic comparability are mentioned but not clearly described what it is meant in this context.</p> <p>Regarding the subject: I would like just to underline that a dataset, which claims to have some value, has more information than just a title. Then this additional information should be used for any usage.</p> <p>Using datasets (especially in comparison) just according to titles without having information available like a description of the data or definitions of the columns/variables is questionable. It doesn't seem to make sense because the data is generated to describe something. But with poor information on the data one doesn't know what the data describes. That means one does speculations. Does this make sense? Comparison of data is a rather complicated animal. It is necessary to understand the data. This could be the basis for comparison.</p> <p>Comparison approaches should really more focus on descriptions/definitions of the data respectively the columns/variables which are the compounds of the data.</p> <p>Any work on just the titles seems to focus on something which can be done indeed but it doesn't focus on something which really makes sense in a general way. This type of work could only make sense if one has no other information available than titles. Then this approach can be used to decipher the object of which the data describes. But one has to be aware that even then the understanding is very limited.</p></div>
HasBody: note-20141018003002
HasTarget: Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo: Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy: replying
Name: enAnonymous Reviewer replied on 2014-10-18 00:30:02
Published: 18 October 2014 02:30

Anonymous Reviewer replied on 2014-10-18 00:30:27

<div typeof='oa:TextualBody' resource='#note-20141018003027' property='rdf:value' datatype='rdf:HTML'> <p>Semantic similarity and statistical correlation is an age-old problem, and it is an interesting idea to tackle it in the linked data context. But there are so many pitfalls in this area that you have to be very precise and careful. The problem here is that I simply cannot understand the formulation of the work hypotheses in 2.2. The notion of "semantic similarity between datasets" should be defined, because it makes no sense to me.</p> <p>The description of the workflow is too sketchy, and the results section is too short (and the figures not legible.)</p> <p>In short: interesting topic, but the paper is not finished. I will be happy to give a better evaluation when it is more complete.</p></div>

AnnotatedAt: 18 October 2014 02:30
Content: <div typeof='oa:TextualBody' resource='#note-20141018003027' property='rdf:value' datatype='rdf:HTML'> <p>Semantic similarity and statistical correlation is an age-old problem, and it is an interesting idea to tackle it in the linked data context. But there are so many pitfalls in this area that you have to be very precise and careful. The problem here is that I simply cannot understand the formulation of the work hypotheses in 2.2. The notion of "semantic similarity between datasets" should be defined, because it makes no sense to me.</p> <p>The description of the workflow is too sketchy, and the results section is too short (and the figures not legible.)</p> <p>In short: interesting topic, but the paper is not finished. I will be happy to give a better evaluation when it is more complete.</p></div>
Creator: enAnonymous Reviewer
DatePublished: 18 October 2014 02:30
Description: <div typeof='oa:TextualBody' resource='#note-20141018003027' property='rdf:value' datatype='rdf:HTML'> <p>Semantic similarity and statistical correlation is an age-old problem, and it is an interesting idea to tackle it in the linked data context. But there are so many pitfalls in this area that you have to be very precise and careful. The problem here is that I simply cannot understand the formulation of the work hypotheses in 2.2. The notion of "semantic similarity between datasets" should be defined, because it makes no sense to me.</p> <p>The description of the workflow is too sketchy, and the results section is too short (and the figures not legible.)</p> <p>In short: interesting topic, but the paper is not finished. I will be happy to give a better evaluation when it is more complete.</p></div>
HasBody: note-20141018003027
HasTarget: Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo: Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy: replying
Name: enAnonymous Reviewer replied on 2014-10-18 00:30:27
Published: 18 October 2014 02:30

Armin Haller

Person

Name: enArmin Haller
Url: Armin Haller

Armin Haller replied on 2014-10-18 00:31:28

<div typeof='oa:TextualBody' resource='#note-20141018003128' property='rdf:value' datatype='rdf:HTML'> <p>I think there are two main problems, why the reviews weren't that great. The first is the Hypothesis is unclear/semantically wrong "If the absence of semantically rich connection between datasets is inadequate". There is some double negation in there which I don't understand and which I think non of the reviewers understood (even though we can guess what you meant). The second problem was the lack of examples. Dataset titles is broad and you did not provide a definition of what are your titles (sub-datasets are included?) and what did you use, the labels, the URI, and why not a bit more than just the titles? It may have been more meaningful to extract some keywords from the title, the description and especially the DTDs. Just on the titles seems a bit restrictive. I know it is a workshop paper, but the reviewers probably thought from the beginning that based on the titles there wouldn't be much overlap/correlation.</p> <p>I really like the idea, though, and even though the last reviewer pointed out that semantic similiartiy and statistical correlation is an age-old problem, I have not seen a single paper doing that in our domain. What if you enlarge the correlation analysis with the DTDs. I think that would make it a lot more convincing and might even result in not rejecting your null hypothesis. Also, why not doing it on more datasets? You probably don't have enough datasets in one domain, I guess? Happy to contribute to that. Also, including some of the ABS datasets might make it even better.</p></div>

Actor: Armin Haller
AnnotatedAt: 18 October 2014 02:31
AnnotatedBy: Armin Haller
Content: <div typeof='oa:TextualBody' resource='#note-20141018003128' property='rdf:value' datatype='rdf:HTML'> <p>I think there are two main problems, why the reviews weren't that great. The first is the Hypothesis is unclear/semantically wrong "If the absence of semantically rich connection between datasets is inadequate". There is some double negation in there which I don't understand and which I think non of the reviewers understood (even though we can guess what you meant). The second problem was the lack of examples. Dataset titles is broad and you did not provide a definition of what are your titles (sub-datasets are included?) and what did you use, the labels, the URI, and why not a bit more than just the titles? It may have been more meaningful to extract some keywords from the title, the description and especially the DTDs. Just on the titles seems a bit restrictive. I know it is a workshop paper, but the reviewers probably thought from the beginning that based on the titles there wouldn't be much overlap/correlation.</p> <p>I really like the idea, though, and even though the last reviewer pointed out that semantic similiartiy and statistical correlation is an age-old problem, I have not seen a single paper doing that in our domain. What if you enlarge the correlation analysis with the DTDs. I think that would make it a lot more convincing and might even result in not rejecting your null hypothesis. Also, why not doing it on more datasets? You probably don't have enough datasets in one domain, I guess? Happy to contribute to that. Also, including some of the ABS datasets might make it even better.</p></div>
Creator: Armin Haller
DatePublished: 18 October 2014 02:31
Description: <div typeof='oa:TextualBody' resource='#note-20141018003128' property='rdf:value' datatype='rdf:HTML'> <p>I think there are two main problems, why the reviews weren't that great. The first is the Hypothesis is unclear/semantically wrong "If the absence of semantically rich connection between datasets is inadequate". There is some double negation in there which I don't understand and which I think non of the reviewers understood (even though we can guess what you meant). The second problem was the lack of examples. Dataset titles is broad and you did not provide a definition of what are your titles (sub-datasets are included?) and what did you use, the labels, the URI, and why not a bit more than just the titles? It may have been more meaningful to extract some keywords from the title, the description and especially the DTDs. Just on the titles seems a bit restrictive. I know it is a workshop paper, but the reviewers probably thought from the beginning that based on the titles there wouldn't be much overlap/correlation.</p> <p>I really like the idea, though, and even though the last reviewer pointed out that semantic similiartiy and statistical correlation is an age-old problem, I have not seen a single paper doing that in our domain. What if you enlarge the correlation analysis with the DTDs. I think that would make it a lot more convincing and might even result in not rejecting your null hypothesis. Also, why not doing it on more datasets? You probably don't have enough datasets in one domain, I guess? Happy to contribute to that. Also, including some of the ABS datasets might make it even better.</p></div>
HasBody: note-20141018003128
HasTarget: Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo: Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy: replying
Name: enArmin Haller replied on 2014-10-18 00:31:28
Published: 18 October 2014 02:31

Bern University of Applied Sciences

Organization

Name: enBern University of Applied Sciences
Url: http://bfh.ch/

Conclusions and Future Work

<p>We believe that the presented work here and the prior Linked Statistical Data Analysis effort contributes towards strengthening the relationship between Semantic Web and statistical research. What we have set out to investigate was to minimize human involvement for discovering useful correlations in statistical data. We have implemented a workflow in which we can automate the analysis process, from data retrieval to outputting analysis results for candidate semantic linkages in Linked Statistical Data.</p> <p>We have evaluated our results by testing and verifying the null hypothesis which we have put forward. While it turned out that the semantic similarity between datasets titles were not useful to determine strong and meaningful correlations — which is a useful finding, in any case — it left us with the remaining alternative hypothesis that can be used in future research.</p> <p resource='#future-work' rel='schema:hasPart' about=''><span property='schema:description' typeof='deo:FutureWork' about='#future-work' id='future-work'>Possibly fruitful future work might want to run a similar experiment with the semantic similarity of dataset descriptions, test manually configured useful relations for a controlled set of datasets, or looking into interlinked topic domains across linked dataspaces.</span></p> <p>Where is <em>interestingness</em> hidden?</p>

Conclusion

Description: <p>We believe that the presented work here and the prior Linked Statistical Data Analysis effort contributes towards strengthening the relationship between Semantic Web and statistical research. What we have set out to investigate was to minimize human involvement for discovering useful correlations in statistical data. We have implemented a workflow in which we can automate the analysis process, from data retrieval to outputting analysis results for candidate semantic linkages in Linked Statistical Data.</p> <p>We have evaluated our results by testing and verifying the null hypothesis which we have put forward. While it turned out that the semantic similarity between datasets titles were not useful to determine strong and meaningful correlations — which is a useful finding, in any case — it left us with the remaining alternative hypothesis that can be used in future research.</p> <p resource='#future-work' rel='schema:hasPart' about=''><span property='schema:description' typeof='deo:FutureWork' about='#future-work' id='future-work'>Possibly fruitful future work might want to run a similar experiment with the semantic similarity of dataset descriptions, test manually configured useful relations for a controlled set of datasets, or looking into interlinked topic domains across linked dataspaces.</span></p> <p>Where is <em>interestingness</em> hidden?</p>
Name: enConclusions and Future Work

correlation

SIO 000367

IsPartOf: H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.
Label: encorrelation

Correlation analysis with variables semantic similarity and correlation of dataset

<p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values <0.05 and >0.95, as well as correlation values with <em>p</em>-value>0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p>

Description: <p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values <0.05 and >0.95, as well as correlation values with <em>p</em>-value>0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p>
Name: enCorrelation analysis with variables semantic similarity and correlation of dataset

Correlations for each dataset pair

<p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n>10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n<50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p>

Description: <p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n>10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n<50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p>
Name: enCorrelations for each dataset pair

Create correlation and other analysis using variables semantic similarity and correlation of LSD.

WorkflowTemplateProcess

IsStepOfTemplate: Workflow
Label: enCreate correlation and other analysis using variables semantic similarity and correlation of LSD.

Create correlations and other analysis for each dataset pair combination.

WorkflowTemplateProcess

IsStepOfTemplate: Workflow
Label: enCreate correlations and other analysis for each dataset pair combination.

Create dataset metadata subset for semantic similarity.

WorkflowTemplateProcess

IsStepOfTemplate: Workflow
Label: enCreate dataset metadata subset for semantic similarity.

Create hypothesis

WorkflowTemplateProcess

IsStepOfTemplate: Workflow
Label: enCreate hypothesis

Create semantic similarity for each dataset pair combination.

WorkflowTemplateProcess

IsStepOfTemplate: Workflow
Label: enCreate semantic similarity for each dataset pair combination.

Data

<p>We decided to conduct our experiment on a simple dataset structure, containing two dimensions; <em>reference area</em>, and <em>reference period</em>, and one measure <em>value</em> for its observations, where the <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/dataset/world-bank-indicators' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators' id='data-worldbank-indicators'>World Bank indicators</a></span> was a good candidate from the 270a Cloud. The rationale for using only one dataspace (at this time) was to remain within a consistent classification space to measure semantic similarity. We fixed the reference period to 2012, and datasets that are part of <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/classification/topic/4' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators-topic-4' id='data-worldbank-indicators-topic-4'>World Bank's education topic</a></span>. We have identified one downside concerning the data quality i.e., the attribute/unit information was incorporated as part of the dataset title, usually as a suffix within brackets. We dealt with this by removing the attribute information from the titles as part of preprocessing in the semantic similarity phase.</p>

Data

Description: <p>We decided to conduct our experiment on a simple dataset structure, containing two dimensions; <em>reference area</em>, and <em>reference period</em>, and one measure <em>value</em> for its observations, where the <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/dataset/world-bank-indicators' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators' id='data-worldbank-indicators'>World Bank indicators</a></span> was a good candidate from the 270a Cloud. The rationale for using only one dataspace (at this time) was to remain within a consistent classification space to measure semantic similarity. We fixed the reference period to 2012, and datasets that are part of <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/classification/topic/4' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators-topic-4' id='data-worldbank-indicators-topic-4'>World Bank's education topic</a></span>. We have identified one downside concerning the data quality i.e., the attribute/unit information was incorporated as part of the dataset title, usually as a suffix within brackets. We dealt with this by removing the attribute information from the titles as part of preprocessing in the semantic similarity phase.</p>
Name: enData

detailseite.html?tx bfhpersonalpages p=rer2&tx bfhpersonalpages screen=data

Email: mailto:reinhard.riedl@bfh.ch

Determine datasets and configurations

WorkflowTemplateProcess

IsStepOfTemplate: Workflow
Label: enDetermine datasets and configurations

Experiment

<p>Two experiments were conducted using the same workflow. Experiments differed only by their input data. In the first experiment, the analysis was done for a particular reference year over all available datasets. In the second experiment, however, we restricted the data further for only a particular dataset domain (topic), thereby making it possible to compare whether a control over a topic can be significant for semantic similarity of the dataset titles.</p> <section resource='#data' rel='schema:hasPart' id='data'> <h3 property='schema:name'>Data</h3> <div typeof='deo:Data' resource='#data' property='schema:description' datatype='rdf:HTML'> <p>We decided to conduct our experiment on a simple dataset structure, containing two dimensions; <em>reference area</em>, and <em>reference period</em>, and one measure <em>value</em> for its observations, where the <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/dataset/world-bank-indicators' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators' id='data-worldbank-indicators'>World Bank indicators</a></span> was a good candidate from the 270a Cloud. The rationale for using only one dataspace (at this time) was to remain within a consistent classification space to measure semantic similarity. We fixed the reference period to 2012, and datasets that are part of <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/classification/topic/4' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators-topic-4' id='data-worldbank-indicators-topic-4'>World Bank's education topic</a></span>. We have identified one downside concerning the data quality i.e., the attribute/unit information was incorporated as part of the dataset title, usually as a suffix within brackets. We dealt with this by removing the attribute information from the titles as part of preprocessing in the semantic similarity phase.</p> </div> </section> <section about='#experiment-workflow-worldbank' rel='schema:hasPart' id='experiment-workflow-worldbank'> <h3 property='schema:name'>World Bank Indicators workflow</h3> <div property='schema:description' datatype='rdf:HTML'> <p>The workflow of our experiment is summarized as follows:</p> <section resource='#experiment-workflow-worldbank-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-correlations'> <h4 property='schema:name'>Correlations for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n>10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n<50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity'> <h4 property='schema:name'>Semantic similarity for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity-correlations'> <h4 property='schema:name'>Correlation analysis with variables semantic similarity and correlation of dataset</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values <0.05 and >0.95, as well as correlation values with <em>p</em>-value>0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p> </div> </section> </div> </section>

Description: <p>Two experiments were conducted using the same workflow. Experiments differed only by their input data. In the first experiment, the analysis was done for a particular reference year over all available datasets. In the second experiment, however, we restricted the data further for only a particular dataset domain (topic), thereby making it possible to compare whether a control over a topic can be significant for semantic similarity of the dataset titles.</p> <section resource='#data' rel='schema:hasPart' id='data'> <h3 property='schema:name'>Data</h3> <div typeof='deo:Data' resource='#data' property='schema:description' datatype='rdf:HTML'> <p>We decided to conduct our experiment on a simple dataset structure, containing two dimensions; <em>reference area</em>, and <em>reference period</em>, and one measure <em>value</em> for its observations, where the <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/dataset/world-bank-indicators' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators' id='data-worldbank-indicators'>World Bank indicators</a></span> was a good candidate from the 270a Cloud. The rationale for using only one dataspace (at this time) was to remain within a consistent classification space to measure semantic similarity. We fixed the reference period to 2012, and datasets that are part of <span rel='schema:hasPart' about=''><a href='http://worldbank.270a.info/classification/topic/4' rel='rdfs:seeAlso' property='rdfs:label' typeof='deo:DatasetDescription' about='#data-worldbank-indicators-topic-4' id='data-worldbank-indicators-topic-4'>World Bank's education topic</a></span>. We have identified one downside concerning the data quality i.e., the attribute/unit information was incorporated as part of the dataset title, usually as a suffix within brackets. We dealt with this by removing the attribute information from the titles as part of preprocessing in the semantic similarity phase.</p> </div> </section> <section about='#experiment-workflow-worldbank' rel='schema:hasPart' id='experiment-workflow-worldbank'> <h3 property='schema:name'>World Bank Indicators workflow</h3> <div property='schema:description' datatype='rdf:HTML'> <p>The workflow of our experiment is summarized as follows:</p> <section resource='#experiment-workflow-worldbank-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-correlations'> <h4 property='schema:name'>Correlations for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n>10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n<50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity'> <h4 property='schema:name'>Semantic similarity for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity-correlations'> <h4 property='schema:name'>Correlation analysis with variables semantic similarity and correlation of dataset</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values <0.05 and >0.95, as well as correlation values with <em>p</em>-value>0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p> </div> </section> </div> </section>
HasPart: Data
Name: enExperiment

experiment-workflow-worldbank

HasPart

World Bank Indicators workflow

<p>The workflow of our experiment is summarized as follows:</p> <section resource='#experiment-workflow-worldbank-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-correlations'> <h4 property='schema:name'>Correlations for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n>10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n<50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity'> <h4 property='schema:name'>Semantic similarity for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity-correlations'> <h4 property='schema:name'>Correlation analysis with variables semantic similarity and correlation of dataset</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values <0.05 and >0.95, as well as correlation values with <em>p</em>-value>0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p> </div> </section>

Description: <p>The workflow of our experiment is summarized as follows:</p> <section resource='#experiment-workflow-worldbank-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-correlations'> <h4 property='schema:name'>Correlations for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href='http://worldbank.270a.info/'>World Bank Linked Dataspace</a> [<a href='#ref-8'>8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n>10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n<50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity'> <h4 property='schema:name'>Semantic similarity for each dataset pair</h4> <div property='schema:description' datatype='rdf:HTML'> <p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p> </div> </section> <section resource='#experiment-workflow-worldbank-similarity-correlations' rel='schema:hasPart' id='experiment-workflow-worldbank-similarity-correlations'> <h4 property='schema:name'>Correlation analysis with variables semantic similarity and correlation of dataset</h4> <div property='schema:description' datatype='rdf:HTML'> <p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values <0.05 and >0.95, as well as correlation values with <em>p</em>-value>0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p> <p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p> </div> </section>
HasPart: Correlation analysis with variables semantic similarity and correlation of dataset; Correlations for each dataset pair; Semantic similarity for each dataset pair
Name: enWorld Bank Indicators workflow

future-work

Possibly fruitful future work might want to run a similar experiment with the semantic similarity of dataset descriptions, test manually configured useful relations for a controlled set of datasets, or looking into interlinked topic domains across linked dataspaces.

FutureWork

Description: enPossibly fruitful future work might want to run a similar experiment with the semantic similarity of dataset descriptions, test manually configured useful relations for a controlled set of datasets, or looking into interlinked topic domains across linked dataspaces.

Get each dataset's observations.

WorkflowTemplateProcess

IsStepOfTemplate: Workflow
Label: enGet each dataset's observations.

Get metadata of datasets.

WorkflowTemplateProcess

IsStepOfTemplate: Workflow
Label: enGet metadata of datasets.

http://www.albertmeronyo.org/

Email: mailto:albert.merono@vu.nl
MemberOf: VU University Amsterdam

Hypothesis

<p>Given our research question, we would like to propose a viable research hypothesis, followed by our investigation with the null hypothesis:</p> <p typeof='sio:SIO_000284' resource='#hypothesis-alternative' rel='schema:hasPart' id='hypothesis-alternative'><span property='rdfs:label'>H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity.</span></p> <p typeof='sio:SIO_000284' resource='#hypothesis-null' rel='schema:hasPart' id='hypothesis-null'><span property='rdfs:label'>H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.</span></p> <span typeof='sio:SIO_000367' resource='#hypothesis-null' rel='schema:isPartOf' property='rdfs:label' content='semantic similarity' about='#variable-semantic-similarity-lsd-titles'></span><span typeof='sio:SIO_000367' resource='#hypothesis-null' rel='schema:isPartOf' property='rdfs:label' content='correlation' about='#variable-correlation-lsd'></span> <p>We set the significance level to 5% probability.</p>

ProblemStatement

Description: <p>Given our research question, we would like to propose a viable research hypothesis, followed by our investigation with the null hypothesis:</p> <p typeof='sio:SIO_000284' resource='#hypothesis-alternative' rel='schema:hasPart' id='hypothesis-alternative'><span property='rdfs:label'>H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity.</span></p> <p typeof='sio:SIO_000284' resource='#hypothesis-null' rel='schema:hasPart' id='hypothesis-null'><span property='rdfs:label'>H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.</span></p> <span typeof='sio:SIO_000367' resource='#hypothesis-null' rel='schema:isPartOf' property='rdfs:label' content='semantic similarity' about='#variable-semantic-similarity-lsd-titles'></span><span typeof='sio:SIO_000367' resource='#hypothesis-null' rel='schema:isPartOf' property='rdfs:label' content='correlation' about='#variable-correlation-lsd'></span> <p>We set the significance level to 5% probability.</p>
HasPart: H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.; H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity.
Name: enHypothesis

H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.

SIO 000284

Label: enH₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.

H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity.

SIO 000284

Label: enH₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity.

Implementation

<p>We have an implementation of the <a href='https://github.com/csarven/lsd-sense'>LSD Sense</a> workflow which can be used to both, reproduce our experiments, as well as run it on new input datasets. With the exception of determining which datasets to inspect, and the system configuration, LSD Sense is automated.</p> <p id='semantic-correlation'><strong>Semantic Correlation</strong>: The semantic similarity algorithm is based on a <a href='http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf'>Latent Semantic Index</a> (LSI) [<a href='#ref-4'>4</a>]. We use the dataset titles to check for their similarity. Essentially, LSI puts each dataset title into a cluster. The number of clusters can be adjusted (default to 200). It remains as an open research question as to what it should be. Generally, research has demonstrated that optimal values depend on the size and nature of the dataset [<a href='#ref-5'>5</a>]. We use <a href='http://radimrehurek.com/gensim/index.html'>gensim</a> [<a href='#ref-6'>6</a>] in our <a href='https://github.com/albertmeronyo/SemanticCorrelation'>Semantic Correlation</a> [<a href='#ref-7'>7</a>] implementation for LSD Sense.</p> <p>Concerning the quality of the dataset titles, it is possible to come across datasets that differ only by one word e.g., <q>male</q>, <q>female</q>. This potentially lowers the accuracy to differentiate datasets. As mentioned earlier, we removed the attribute information from the dataset titles with the assumption that it reduced noise.</p>

Description: <p>We have an implementation of the <a href='https://github.com/csarven/lsd-sense'>LSD Sense</a> workflow which can be used to both, reproduce our experiments, as well as run it on new input datasets. With the exception of determining which datasets to inspect, and the system configuration, LSD Sense is automated.</p> <p id='semantic-correlation'><strong>Semantic Correlation</strong>: The semantic similarity algorithm is based on a <a href='http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf'>Latent Semantic Index</a> (LSI) [<a href='#ref-4'>4</a>]. We use the dataset titles to check for their similarity. Essentially, LSI puts each dataset title into a cluster. The number of clusters can be adjusted (default to 200). It remains as an open research question as to what it should be. Generally, research has demonstrated that optimal values depend on the size and nature of the dataset [<a href='#ref-5'>5</a>]. We use <a href='http://radimrehurek.com/gensim/index.html'>gensim</a> [<a href='#ref-6'>6</a>] in our <a href='https://github.com/albertmeronyo/SemanticCorrelation'>Semantic Correlation</a> [<a href='#ref-7'>7</a>] implementation for LSD Sense.</p> <p>Concerning the quality of the dataset titles, it is possible to come across datasets that differ only by one word e.g., <q>male</q>, <q>female</q>. This potentially lowers the accuracy to differentiate datasets. As mentioned earlier, we removed the attribute information from the dataset titles with the assumption that it reduced noise.</p>
Name: enImplementation

Introduction

There was this American who was afraid of a heart attack and he found out that the Japanese ate very little fat and almost did not drink wine but they had much less heart attacks than the Americans. But on the other hand he also found out that the French eat as much fat as the Americans and they drink much more wine but they also have less heart attacks. So he concluded that what kills you is speaking English [1]. While computers can assist us in discovering strong correlations in large amounts of statistical datasets, whether by chance or through sophisticated methods, humans (or sometimes also known as domain experts) still need to be critical about the results and interpret them appropriately. This implies that we are still very much involved in the process in discovering meaningful correlations by filtering through everything that is presented to us. If we could however improve the situation slightly by having machines present us with only useful correlations from a random mass of correlations, then we can give more of our attention to what is interesting. Hence, our goal is to set a path towards identifying why some variables have a semantic link between them. Before we establish that, our ongoing approach (as outlined in this research and afterwards) will be to refute or cancel out things which may be in disguise for semantic similarity. Therefore, we set our investigation with a workflow to experiment with Linked Statistical Datasets in the 270a Cloud [2]. We have first set our hypothesis to uncover the possibility that semantically similar variables or datasets need to incorporate semantically rich information in order to find thought-provoking correlations. Then, the question is, what do exceptional or intriguing linkages for semantic similarity look like? We start with our null hypothesis by checking to see whether the dataset titles in World Bank indicators can help indicate strong correlations. Our results show that dataset titles by themselves or within a particular topic area is not a good indicator to predict correlations.

Introduction

Description: en There was this American who was afraid of a heart attack and he found out that the Japanese ate very little fat and almost did not drink wine but they had much less heart attacks than the Americans. But on the other hand he also found out that the French eat as much fat as the Americans and they drink much more wine but they also have less heart attacks. So he concluded that what kills you is speaking English [1]. While computers can assist us in discovering strong correlations in large amounts of statistical datasets, whether by chance or through sophisticated methods, humans (or sometimes also known as domain experts) still need to be critical about the results and interpret them appropriately. This implies that we are still very much involved in the process in discovering meaningful correlations by filtering through everything that is presented to us. If we could however improve the situation slightly by having machines present us with only useful correlations from a random mass of correlations, then we can give more of our attention to what is interesting. Hence, our goal is to set a path towards identifying why some variables have a semantic link between them. Before we establish that, our ongoing approach (as outlined in this research and afterwards) will be to refute or cancel out things which may be in disguise for semantic similarity. Therefore, we set our investigation with a workflow to experiment with Linked Statistical Datasets in the 270a Cloud [2]. We have first set our hypothesis to uncover the possibility that semantically similar variables or datasets need to incorporate semantically rich information in order to find thought-provoking correlations. Then, the question is, what do exceptional or intriguing linkages for semantic similarity look like? We start with our null hypothesis by checking to see whether the dataset titles in World Bank indicators can help indicate strong correlations. Our results show that dataset titles by themselves or within a particular topic area is not a good indicator to predict correlations.
Name: enIntroduction

Linked Statistical Data and Semantic Similarity

<p>The RDF Data Cube vocabulary does not only allow one to express statistical data in a Web exchangeable format, but also to represent the (semantic) links within those statistical data. This ability poses some new interesting research questions around the relationship between the statistical and semantic relatedness of datasets. We are interested in the interplay of statistical correlation of LSD and their semantic similarity, in order to answer questions like: Does correlation between statistical datasets imply some kind of semantic relation? Do certain semantic links imply the existence of correlation? We propose a generic workflow for studying whether or not this relation between correlation and similarity holds for arbitrary LSD. We aim at generic correlation and similarity measures, and our workflow enables the use of any correlation and similarity indicators. For the specific goal of this paper, though, we stick to the use of Kendall's correlation coefficient and Latent Semantic Analysis (LSA) similarity.</p>

Description: <p>The RDF Data Cube vocabulary does not only allow one to express statistical data in a Web exchangeable format, but also to represent the (semantic) links within those statistical data. This ability poses some new interesting research questions around the relationship between the statistical and semantic relatedness of datasets. We are interested in the interplay of statistical correlation of LSD and their semantic similarity, in order to answer questions like: Does correlation between statistical datasets imply some kind of semantic relation? Do certain semantic links imply the existence of correlation? We propose a generic workflow for studying whether or not this relation between correlation and similarity holds for arbitrary LSD. We aim at generic correlation and similarity measures, and our workflow enables the use of any correlation and similarity indicators. For the specific goal of this paper, though, we stick to the use of Kendall's correlation coefficient and Latent Semantic Analysis (LSA) similarity.</p>
Name: enLinked Statistical Data and Semantic Similarity

Methodology

We first state our research design and hypothesis, then discuss how we employed Linked Statistical Data (LSD) and Semantic Similarity approaches for a workflow in our LSD Sense [3] implementation. Research design Research problem: Why do machines have difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets? Put another way: How can machines uncover interesting correlations? Over this ongoing investigation, we want to uncover some of the fundamental components for measuring and declaring semantic similarity between datasets, in order to better predict relevant strong relationships. Can semantic relatedness between datasets imply statistical correlation of the related data points in the datasets? Hypothesis Given our research question, we would like to propose a viable research hypothesis, followed by our investigation with the null hypothesis: H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity. H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity. We set the significance level to 5% probability. Linked Statistical Data and Semantic Similarity The RDF Data Cube vocabulary does not only allow one to express statistical data in a Web exchangeable format, but also to represent the (semantic) links within those statistical data. This ability poses some new interesting research questions around the relationship between the statistical and semantic relatedness of datasets. We are interested in the interplay of statistical correlation of LSD and their semantic similarity, in order to answer questions like: Does correlation between statistical datasets imply some kind of semantic relation? Do certain semantic links imply the existence of correlation? We propose a generic workflow for studying whether or not this relation between correlation and similarity holds for arbitrary LSD. We aim at generic correlation and similarity measures, and our workflow enables the use of any correlation and similarity indicators. For the specific goal of this paper, though, we stick to the use of Kendall's correlation coefficient and Latent Semantic Analysis (LSA) similarity. Workflow Based on preliminary experimentation from data acquisition to analysis, we have created the LSD Sense workflow: Create hypothesis Determine datasets and configurations Get metadata of datasets. Get each dataset's observations. Create correlations and other analysis for each dataset pair combination. Create dataset metadata subset for semantic similarity. Create semantic similarity for each dataset pair combination. Create correlation and other analysis using variables semantic similarity and correlation of LSD. Test and verify hypothesis. Analysis. Implementation We have an implementation of the LSD Sense workflow which can be used to both, reproduce our experiments, as well as run it on new input datasets. With the exception of determining which datasets to inspect, and the system configuration, LSD Sense is automated. Semantic Correlation: The semantic similarity algorithm is based on a Latent Semantic Index (LSI) [4]. We use the dataset titles to check for their similarity. Essentially, LSI puts each dataset title into a cluster. The number of clusters can be adjusted (default to 200). It remains as an open research question as to what it should be. Generally, research has demonstrated that optimal values depend on the size and nature of the dataset [5]. We use gensim [6] in our Semantic Correlation [7] implementation for LSD Sense. Concerning the quality of the dataset titles, it is possible to come across datasets that differ only by one word e.g., male, female. This potentially lowers the accuracy to differentiate datasets. As mentioned earlier, we removed the attribute information from the dataset titles with the assumption that it reduced noise.

Methods

Description: en We first state our research design and hypothesis, then discuss how we employed Linked Statistical Data (LSD) and Semantic Similarity approaches for a workflow in our LSD Sense [3] implementation. Research design Research problem: Why do machines have difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets? Put another way: How can machines uncover interesting correlations? Over this ongoing investigation, we want to uncover some of the fundamental components for measuring and declaring semantic similarity between datasets, in order to better predict relevant strong relationships. Can semantic relatedness between datasets imply statistical correlation of the related data points in the datasets? Hypothesis Given our research question, we would like to propose a viable research hypothesis, followed by our investigation with the null hypothesis: H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity. H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity. We set the significance level to 5% probability. Linked Statistical Data and Semantic Similarity The RDF Data Cube vocabulary does not only allow one to express statistical data in a Web exchangeable format, but also to represent the (semantic) links within those statistical data. This ability poses some new interesting research questions around the relationship between the statistical and semantic relatedness of datasets. We are interested in the interplay of statistical correlation of LSD and their semantic similarity, in order to answer questions like: Does correlation between statistical datasets imply some kind of semantic relation? Do certain semantic links imply the existence of correlation? We propose a generic workflow for studying whether or not this relation between correlation and similarity holds for arbitrary LSD. We aim at generic correlation and similarity measures, and our workflow enables the use of any correlation and similarity indicators. For the specific goal of this paper, though, we stick to the use of Kendall's correlation coefficient and Latent Semantic Analysis (LSA) similarity. Workflow Based on preliminary experimentation from data acquisition to analysis, we have created the LSD Sense workflow: Create hypothesis Determine datasets and configurations Get metadata of datasets. Get each dataset's observations. Create correlations and other analysis for each dataset pair combination. Create dataset metadata subset for semantic similarity. Create semantic similarity for each dataset pair combination. Create correlation and other analysis using variables semantic similarity and correlation of LSD. Test and verify hypothesis. Analysis. Implementation We have an implementation of the LSD Sense workflow which can be used to both, reproduce our experiments, as well as run it on new input datasets. With the exception of determining which datasets to inspect, and the system configuration, LSD Sense is automated. Semantic Correlation: The semantic similarity algorithm is based on a Latent Semantic Index (LSI) [4]. We use the dataset titles to check for their similarity. Essentially, LSI puts each dataset title into a cluster. The number of clusters can be adjusted (default to 200). It remains as an open research question as to what it should be. Generally, research has demonstrated that optimal values depend on the size and nature of the dataset [5]. We use gensim [6] in our Semantic Correlation [7] implementation for LSD Sense. Concerning the quality of the dataset titles, it is possible to come across datasets that differ only by one word e.g., male, female. This potentially lowers the accuracy to differentiate datasets. As mentioned earlier, we removed the attribute information from the dataset titles with the assumption that it reduced noise.
HasPart: Hypothesis; Implementation; Linked Statistical Data and Semantic Similarity; research-design; Workflow
Name: enMethodology

motivation

If we could however improve the situation slightly by having machines present us with only useful correlations from a random mass of correlations, then we can give more of our attention to what is interesting. Hence, our goal is to set a path towards identifying why some variables have a semantic link between them. Before we establish that, our ongoing approach (as outlined in this research and afterwards) will be to refute or cancel out things which may be in disguise for semantic similarity.

Motivation

Description: enIf we could however improve the situation slightly by having machines present us with only useful correlations from a random mass of correlations, then we can give more of our attention to what is interesting. Hence, our goal is to set a path towards identifying why some variables have a semantic link between them. Before we establish that, our ongoing approach (as outlined in this research and afterwards) will be to refute or cancel out things which may be in disguise for semantic similarity.

note-20140722142223

TextualBody

Value: <p>On reading the paper, I see you tried to do on a large scale much what I did on a small scale in 2003, namely, to get some mileage out of titles only. I found a way to enhance the semantic content of these particular titles (browser bookmarks). I found the approach to be of value in retrieval. You might like to read my paper on this, "Browser bookmark management with Topic Maps", at</p> <p><a href='http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html'>http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html</a></p> <p>I also played with a very different title clustering method, and you might like reading that work, too:</p> <p>"On-the-fly Clustering As A Novel RDF Query Mechanism" at</p> <p>http://conferences.idealliance.org/extreme/html/2004/Passin01/EML2004Passin01.html</p> <p> Tom Passin</p>

note-20140722150442

TextualBody

Value: <p>Thanks for publishing your negative results; I wish more people did! This article also clearly detail your methodology and hypothesis, in addition to providing source code and data. </p> <p>One thing - intermediate values? (I am a provenance guy after all..) I browsed through the github source, but was unable to find the source data after cleaning/preprocessing - I wanted to verify how much remained after stripping out units etc.</p>

note-20140722151139

TextualBody

Value: <p>Thanks Stian. The source data was too big to include at GitHub. I could offer it up on this site I suppose.</p> <p>What I'm actually planning to do is simply referring to the analysis and provenance (activity) URIs instead at stats.270a.info.</p> <p>The source data can still be produced by running the scripts (hopefully nothing goes chaos on your end) :)</p>

note-20140808144434

TextualBody

Value: <p>Sarven asked me to review this work. Here are my comments:</p> <p>Overall, an interesting analysis and direction. It’s also of compelling in terms of the underlying linked data approach to making the article itself. First some thoughts on the presentation of the research itself. I think in general the key issue I had is being more precise about the research goals.</p> <p>The key question I had was what constitutes interesting/useful correlations? This is central to the paper and I’m not clear what this means. </p> <ul><li>“The ongoing research problem is, hence, to investigate why machines have a difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets.” -> How is this a research problem?</li><li>“refute or cancel out things which may be in disguise for semantic similarity.” - what does this mean?</li><li>Why don’t you state your research problem using the more simple version: How can machines uncover interesting correlations?</li><li>I don’t know what Hypothesis 1 means? Aren’t you just trying to say that your hypothesis is that “Semantic similarity is a good predictor of meaningful correlations”.</li><li>“Linked Statistical Data and Semantic Similarity” - is lucid and probably a better introduction to what you want to do than what came before it.</li></ul> <p> Thoughts on reproducibility</p> <ul><li>It would be nice to get some interactions on what to run.</li><li>Maybe some setup or pull scripts? Would love to just be a able to git clone and run… I’m a fan of vagrant or other lightweight VMs</li><li>What are your thoughts on supporting content negotiation?</li><li>Could you link from the workflow processes to the github code?</li></ul> <p>Stuff I can do with your paper which is cool:</p> <p>Assume:</p> <p><code>rapper -i rdfa https://csarven.ca/sense-of-lsd-analysis > lsd.nt</code></p> <p>Find we what the problem statement is:</p> <p><code>roqet -i sparql -e "select * where {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/spar/deo/ProblemStatement>}" -D lsd.nt</code></p> <p>List the steps in the research:</p> <p><code>select * where {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.opmw.org/ontology/WorkflowTemplateProcess></code></p> <p>List the contributors</p> <p><code>roqet -i sparql -e "select * where {?s <http://purl.org/dc/terms/contributor> ?o}" -D lsd.nt</code></p> <p>Unfortunately, they don’t all have rdfs:labels.</p> <p>Overall, it really shows what you can do with these technologies.</p> <p>Layout</p> <ul><li>Section numbering would be nice.</li> <li>Linking to sections. It would be nice to provide some winerlinks.</li><li>It would be nice if I could click on the figures to zoom in to them and also link directly back to the directory.</li><li>I would ensure that the datasets are identified and referenced (if possible) from the text.</li></ul> <p> Minor Comments</p> <ul><li>“to discover” —> in discovering</li><li>“before hand” —> “beforehand”</li><li>“The RDF Data Cube vocabulary does not only allow to express statistical data in a Web exchangeable format” -> “not only allow one to…”</li></ul>

note-20141014014833

TextualBody

Value: <p>The basic idea in the paper is good. Determining the effectiveness of links in LOD is very important. I like the way the paper is written; the style is informal. The main conclusion, though negative, is important. In fact, too few papers advance our understanding by reporting negative results. This paper is a nice exception.</p> <p>I found the procedure, written in the abstract, hard to follow. In addition, there were few concrete examples of the results of the procedure to illustrate the process. It seems to me an example illustrating each step of the process needs to be added. I expect this may add one or at most 2 to the length, certainly a tenable addition.</p> <p>The introductory joke at the start of the paper was a welcome surprise. Usually, authors sound so serious, and I found this style made me want to continue. Unfortunately, I found the presentation a little rough. In addition to the lack of example cited above, some areas need a sentence or two more explanation. For instance, LSI may not be known to all readers. The reference is good, but if you say a sentence about what it does or accomplishes, why it is relevant to the study, then that puts the unfamiliar reader at ease. (S)he can check the reference later. Otherwise (s)he is forced to in order to continue reading. Similarly, LSD Sense is not defined. The reference indicates there is code on a Github site. You need to devote a paragraph to what this (LSD Sense) is and does.</p> <p>Relevant also to the lack of examples, it is not at all clear how the elements in the Data Cube vocabulary are used in the experiment. Similarly, the World Bank list of indicators data sets is not spelled out or illustrated.</p> <p>Basically, you looked at data set names. Give some examples!</p> <p>What are the reference areas, reference periods, and values you considered? These are left to the reader to guess.</p> <p>You need to have a full set of examples illustrating the problem, the procedures, and the results.</p>

note-20141018003002

TextualBody

Value: <p>The subject of the paper, exploring dataset titles for similary of datasets, seem to be questionable. Looking at titles - for comparison purposes - is a very poor means. If this is done, then it would only make sense, if nothing else is available. But this was not mentioned in the paper.</p> <p>A dataset, which claims to have some value, has more information than just a title. It has a dataset description and much more descriptive information. Here, data in the form of data cubes are explored. Then definitions of dimensions and code lists are available. Exploring these definitions and codelists has much more value for comparison purposes. The whole exercise of the paper should be applied to this area.</p> <p>Comparison is usually mainly done on the basis of similar or equal definitions of variables (microdata) or dimensions(aggregate data).</p> <p>Some details: Statistical and semantic comparability are mentioned but not clearly described what it is meant in this context.</p> <p>Regarding the subject: I would like just to underline that a dataset, which claims to have some value, has more information than just a title. Then this additional information should be used for any usage.</p> <p>Using datasets (especially in comparison) just according to titles without having information available like a description of the data or definitions of the columns/variables is questionable. It doesn't seem to make sense because the data is generated to describe something. But with poor information on the data one doesn't know what the data describes. That means one does speculations. Does this make sense? Comparison of data is a rather complicated animal. It is necessary to understand the data. This could be the basis for comparison.</p> <p>Comparison approaches should really more focus on descriptions/definitions of the data respectively the columns/variables which are the compounds of the data.</p> <p>Any work on just the titles seems to focus on something which can be done indeed but it doesn't focus on something which really makes sense in a general way. This type of work could only make sense if one has no other information available than titles. Then this approach can be used to decipher the object of which the data describes. But one has to be aware that even then the understanding is very limited.</p>

note-20141018003027

TextualBody

Value: <p>Semantic similarity and statistical correlation is an age-old problem, and it is an interesting idea to tackle it in the linked data context. But there are so many pitfalls in this area that you have to be very precise and careful. The problem here is that I simply cannot understand the formulation of the work hypotheses in 2.2. The notion of "semantic similarity between datasets" should be defined, because it makes no sense to me.</p> <p>The description of the workflow is too sketchy, and the results section is too short (and the figures not legible.)</p> <p>In short: interesting topic, but the paper is not finished. I will be happy to give a better evaluation when it is more complete.</p>

note-20141018003128

TextualBody

Value: <p>I think there are two main problems, why the reviews weren't that great. The first is the Hypothesis is unclear/semantically wrong "If the absence of semantically rich connection between datasets is inadequate". There is some double negation in there which I don't understand and which I think non of the reviewers understood (even though we can guess what you meant). The second problem was the lack of examples. Dataset titles is broad and you did not provide a definition of what are your titles (sub-datasets are included?) and what did you use, the labels, the URI, and why not a bit more than just the titles? It may have been more meaningful to extract some keywords from the title, the description and especially the DTDs. Just on the titles seems a bit restrictive. I know it is a workshop paper, but the reviewers probably thought from the beginning that based on the titles there wouldn't be much overlap/correlation.</p> <p>I really like the idea, though, and even though the last reviewer pointed out that semantic similiartiy and statistical correlation is an age-old problem, I have not seen a single paper doing that in our domain. What if you enlarge the correlation analysis with the DTDs. I think that would make it a lot more convincing and might even result in not rejecting your null hypothesis. Also, why not doing it on more datasets? You probably don't have enough datasets in one domain, I guess? Happy to contribute to that. Also, including some of the ABS datasets might make it even better.</p>

note-20160121085958

TextualBody

Value: <p>I'm confused by your hypothesis test. You specify H0 as "There exists a significant relationship" etc. You did find a strong statistical relationship (p-value < 2.2e-16). You then comment: "Given that both experiments resulted in p-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis."</p> <p>I have never seen a hypothesis test like that. What would it have taken to not reject the hypothesis?</p> <p>Regarding your research question (as far as I understand it), I find it difficult to make sense of. Why should we care whether correlated datasets have similar titles? Sure, it wouldn't be surprising to find that male and female literacy rates are correlated, but then what? What non-obvious insight do we derive from that observation? On the other hand, if we want to causally explain those literacy rates and why they differ for different regions/demographics, we would surely have to investigate variables related to economic and social development and there is no reason whatsoever to expect that the relevant explanatory variables will be semantically similar to the variables we attempt to explain.</p>

note-20160121123544

TextualBody

Value: <p>Addendum: It is also bewildering that several commenters describe this as a "negative result". In standard statistical terminology, rejection of the null hypothesis is a positive (i. e. interesting, unexpected) result.</p>

Paul Groth

Person

Image
Name: enPaul Groth
Url: Paul Groth

Paul Groth replied on 2014-08-08 14:44:34

<div typeof='oa:TextualBody' resource='#note-20140808144434' property='rdf:value' datatype='rdf:HTML'> <p>Sarven asked me to review this work. Here are my comments:</p> <p>Overall, an interesting analysis and direction. It’s also of compelling in terms of the underlying linked data approach to making the article itself. First some thoughts on the presentation of the research itself. I think in general the key issue I had is being more precise about the research goals.</p> <p>The key question I had was what constitutes interesting/useful correlations? This is central to the paper and I’m not clear what this means. </p> <ul><li>“The ongoing research problem is, hence, to investigate why machines have a difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets.” -> How is this a research problem?</li><li>“refute or cancel out things which may be in disguise for semantic similarity.” - what does this mean?</li><li>Why don’t you state your research problem using the more simple version: How can machines uncover interesting correlations?</li><li>I don’t know what Hypothesis 1 means? Aren’t you just trying to say that your hypothesis is that “Semantic similarity is a good predictor of meaningful correlations”.</li><li>“Linked Statistical Data and Semantic Similarity” - is lucid and probably a better introduction to what you want to do than what came before it.</li></ul> <p> Thoughts on reproducibility</p> <ul><li>It would be nice to get some interactions on what to run.</li><li>Maybe some setup or pull scripts? Would love to just be a able to git clone and run… I’m a fan of vagrant or other lightweight VMs</li><li>What are your thoughts on supporting content negotiation?</li><li>Could you link from the workflow processes to the github code?</li></ul> <p>Stuff I can do with your paper which is cool:</p> <p>Assume:</p> <p><code>rapper -i rdfa https://csarven.ca/sense-of-lsd-analysis > lsd.nt</code></p> <p>Find we what the problem statement is:</p> <p><code>roqet -i sparql -e "select * where {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/spar/deo/ProblemStatement>}" -D lsd.nt</code></p> <p>List the steps in the research:</p> <p><code>select * where {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.opmw.org/ontology/WorkflowTemplateProcess></code></p> <p>List the contributors</p> <p><code>roqet -i sparql -e "select * where {?s <http://purl.org/dc/terms/contributor> ?o}" -D lsd.nt</code></p> <p>Unfortunately, they don’t all have rdfs:labels.</p> <p>Overall, it really shows what you can do with these technologies.</p> <p>Layout</p> <ul><li>Section numbering would be nice.</li> <li>Linking to sections. It would be nice to provide some winerlinks.</li><li>It would be nice if I could click on the figures to zoom in to them and also link directly back to the directory.</li><li>I would ensure that the datasets are identified and referenced (if possible) from the text.</li></ul> <p> Minor Comments</p> <ul><li>“to discover” —> in discovering</li><li>“before hand” —> “beforehand”</li><li>“The RDF Data Cube vocabulary does not only allow to express statistical data in a Web exchangeable format” -> “not only allow one to…”</li></ul></div>

Actor: Paul Groth
AnnotatedAt: 8 August 2014 16:44
AnnotatedBy: Paul Groth
Content: <div typeof='oa:TextualBody' resource='#note-20140808144434' property='rdf:value' datatype='rdf:HTML'> <p>Sarven asked me to review this work. Here are my comments:</p> <p>Overall, an interesting analysis and direction. It’s also of compelling in terms of the underlying linked data approach to making the article itself. First some thoughts on the presentation of the research itself. I think in general the key issue I had is being more precise about the research goals.</p> <p>The key question I had was what constitutes interesting/useful correlations? This is central to the paper and I’m not clear what this means. </p> <ul><li>“The ongoing research problem is, hence, to investigate why machines have a difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets.” -> How is this a research problem?</li><li>“refute or cancel out things which may be in disguise for semantic similarity.” - what does this mean?</li><li>Why don’t you state your research problem using the more simple version: How can machines uncover interesting correlations?</li><li>I don’t know what Hypothesis 1 means? Aren’t you just trying to say that your hypothesis is that “Semantic similarity is a good predictor of meaningful correlations”.</li><li>“Linked Statistical Data and Semantic Similarity” - is lucid and probably a better introduction to what you want to do than what came before it.</li></ul> <p> Thoughts on reproducibility</p> <ul><li>It would be nice to get some interactions on what to run.</li><li>Maybe some setup or pull scripts? Would love to just be a able to git clone and run… I’m a fan of vagrant or other lightweight VMs</li><li>What are your thoughts on supporting content negotiation?</li><li>Could you link from the workflow processes to the github code?</li></ul> <p>Stuff I can do with your paper which is cool:</p> <p>Assume:</p> <p><code>rapper -i rdfa https://csarven.ca/sense-of-lsd-analysis > lsd.nt</code></p> <p>Find we what the problem statement is:</p> <p><code>roqet -i sparql -e "select * where {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/spar/deo/ProblemStatement>}" -D lsd.nt</code></p> <p>List the steps in the research:</p> <p><code>select * where {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.opmw.org/ontology/WorkflowTemplateProcess></code></p> <p>List the contributors</p> <p><code>roqet -i sparql -e "select * where {?s <http://purl.org/dc/terms/contributor> ?o}" -D lsd.nt</code></p> <p>Unfortunately, they don’t all have rdfs:labels.</p> <p>Overall, it really shows what you can do with these technologies.</p> <p>Layout</p> <ul><li>Section numbering would be nice.</li> <li>Linking to sections. It would be nice to provide some winerlinks.</li><li>It would be nice if I could click on the figures to zoom in to them and also link directly back to the directory.</li><li>I would ensure that the datasets are identified and referenced (if possible) from the text.</li></ul> <p> Minor Comments</p> <ul><li>“to discover” —> in discovering</li><li>“before hand” —> “beforehand”</li><li>“The RDF Data Cube vocabulary does not only allow to express statistical data in a Web exchangeable format” -> “not only allow one to…”</li></ul></div>
Creator: Paul Groth
DatePublished: 8 August 2014 16:44
Description: <div typeof='oa:TextualBody' resource='#note-20140808144434' property='rdf:value' datatype='rdf:HTML'> <p>Sarven asked me to review this work. Here are my comments:</p> <p>Overall, an interesting analysis and direction. It’s also of compelling in terms of the underlying linked data approach to making the article itself. First some thoughts on the presentation of the research itself. I think in general the key issue I had is being more precise about the research goals.</p> <p>The key question I had was what constitutes interesting/useful correlations? This is central to the paper and I’m not clear what this means. </p> <ul><li>“The ongoing research problem is, hence, to investigate why machines have a difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets.” -> How is this a research problem?</li><li>“refute or cancel out things which may be in disguise for semantic similarity.” - what does this mean?</li><li>Why don’t you state your research problem using the more simple version: How can machines uncover interesting correlations?</li><li>I don’t know what Hypothesis 1 means? Aren’t you just trying to say that your hypothesis is that “Semantic similarity is a good predictor of meaningful correlations”.</li><li>“Linked Statistical Data and Semantic Similarity” - is lucid and probably a better introduction to what you want to do than what came before it.</li></ul> <p> Thoughts on reproducibility</p> <ul><li>It would be nice to get some interactions on what to run.</li><li>Maybe some setup or pull scripts? Would love to just be a able to git clone and run… I’m a fan of vagrant or other lightweight VMs</li><li>What are your thoughts on supporting content negotiation?</li><li>Could you link from the workflow processes to the github code?</li></ul> <p>Stuff I can do with your paper which is cool:</p> <p>Assume:</p> <p><code>rapper -i rdfa https://csarven.ca/sense-of-lsd-analysis > lsd.nt</code></p> <p>Find we what the problem statement is:</p> <p><code>roqet -i sparql -e "select * where {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/spar/deo/ProblemStatement>}" -D lsd.nt</code></p> <p>List the steps in the research:</p> <p><code>select * where {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.opmw.org/ontology/WorkflowTemplateProcess></code></p> <p>List the contributors</p> <p><code>roqet -i sparql -e "select * where {?s <http://purl.org/dc/terms/contributor> ?o}" -D lsd.nt</code></p> <p>Unfortunately, they don’t all have rdfs:labels.</p> <p>Overall, it really shows what you can do with these technologies.</p> <p>Layout</p> <ul><li>Section numbering would be nice.</li> <li>Linking to sections. It would be nice to provide some winerlinks.</li><li>It would be nice if I could click on the figures to zoom in to them and also link directly back to the directory.</li><li>I would ensure that the datasets are identified and referenced (if possible) from the text.</li></ul> <p> Minor Comments</p> <ul><li>“to discover” —> in discovering</li><li>“before hand” —> “beforehand”</li><li>“The RDF Data Cube vocabulary does not only allow to express statistical data in a Web exchangeable format” -> “not only allow one to…”</li></ul></div>
HasBody: note-20140808144434
HasTarget: Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo: Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy: replying
Name: enPaul Groth replied on 2014-08-08 14:44:34
Published: 8 August 2014 16:44

prologue

Prologue

Description: enThere was this American who was afraid of a heart attack and he found out that the Japanese ate very little fat and almost did not drink wine but they had much less heart attacks than the Americans. But on the other hand he also found out that the French eat as much fat as the Americans and they drink much more wine but they also have less heart attacks. So he concluded that what kills you is speaking English

Reinhard Riedl

Person

MemberOf: Bern University of Applied Sciences
Name: enReinhard Riedl
Url: detailseite.html?tx bfhpersonalpages p=rer2&tx bfhpersonalpages screen=data

related-work

<p><a href='https://csarven.ca/linked-statistical-data-analysis'>Linked Statistical Data Analysis</a> [<a href='#ref-9'>9</a>], explores a way to reuse statistical linked dataspaces, federated queries, and generation of statistical analyses e.g., regression, for humans and machines. The <a href='http://stats.270a.info/'>stats.270a.info</a> [<a href='#ref-10'>10</a>] service stores computed analysis, and makes it possible for future discovery.</p> <p><a href='http://www.hicss.hawaii.edu/hicss_46/bp46/hc6.pdf'>Towards Next Generation Health Data Exploration</a>: A Data Cube-based Investigation into Population Statistics for Tobacco [<a href='#ref-11'>11</a>], presents the <a href='http://orion.tw.rpi.edu/~jimmccusker/qb.js/'>qb.js</a> [<a href='#ref-12'>12</a>] tool to explore data that is expressed as RDF Data Cubes. It is designed to formulate and explore hypotheses. Under the hood, it makes a SPARQL query to an endpoint which contains the data that it analyzes.</p> <p><a href='http://www.ke.tu-darmstadt.de/bibtex/attachments/single/310'>Generating Possible Interpretations for Statistics from Linked Open Data</a> [<a href='#ref-13'>13</a>] talks about the Explain-a-LOD tool which focuses on generating hypotheses that explain statistics. It has a configuration to compare two variables, and then provides possible interpretations of the correlation analysis for users to review.</p> <p><a href='http://svn.aksw.org/papers/2013/LODSEM/ISWC2013_AZ_LODSEM_public.pdf'>Using Linked Data to Evaluate the Impact of Research and Development in Europe</a>: A Structural Equation Model [<a href='#ref-14'>14</a>], presents the feasibility of combining different LOD sources to assess the impact of one variable over others.</p> <p><a href='http://tylervigen.com/'>Spurious Correlations</a> [<a href='#ref-15'>15</a>] reveals correlations that are not genuine for practical use. In other words, the correlations are type I errors. It emphasizes on the importance for humans to be critical of random correlations, and to investigate whether there is a direct relation between the variables.</p> <p><a href='http://www.ontologymatching.org/'>Ontology Matching</a> [<a href='#ref-16'>16</a>] is perhaps the most mature field in the Semantic Web dealing with the general problem of finding semantically related entities of ontologies and Linked Data, although resources like WordNet and DBpedia are also related.</p> <p>These studies and engineering efforts have created, inspected, and hypothesized possible correlations. However, the missing gap in research is that there is no integrated study on how semantic relatedness between datasets may enhance the detection of meaningful or useful correlations in statistical data. Our contribution is the investigation of highly probably elements which would lead to better prediction of interesting correlations by employing linked statistical datasets and semantic analysis.</p>

RelatedWork

Description: <p><a href='https://csarven.ca/linked-statistical-data-analysis'>Linked Statistical Data Analysis</a> [<a href='#ref-9'>9</a>], explores a way to reuse statistical linked dataspaces, federated queries, and generation of statistical analyses e.g., regression, for humans and machines. The <a href='http://stats.270a.info/'>stats.270a.info</a> [<a href='#ref-10'>10</a>] service stores computed analysis, and makes it possible for future discovery.</p> <p><a href='http://www.hicss.hawaii.edu/hicss_46/bp46/hc6.pdf'>Towards Next Generation Health Data Exploration</a>: A Data Cube-based Investigation into Population Statistics for Tobacco [<a href='#ref-11'>11</a>], presents the <a href='http://orion.tw.rpi.edu/~jimmccusker/qb.js/'>qb.js</a> [<a href='#ref-12'>12</a>] tool to explore data that is expressed as RDF Data Cubes. It is designed to formulate and explore hypotheses. Under the hood, it makes a SPARQL query to an endpoint which contains the data that it analyzes.</p> <p><a href='http://www.ke.tu-darmstadt.de/bibtex/attachments/single/310'>Generating Possible Interpretations for Statistics from Linked Open Data</a> [<a href='#ref-13'>13</a>] talks about the Explain-a-LOD tool which focuses on generating hypotheses that explain statistics. It has a configuration to compare two variables, and then provides possible interpretations of the correlation analysis for users to review.</p> <p><a href='http://svn.aksw.org/papers/2013/LODSEM/ISWC2013_AZ_LODSEM_public.pdf'>Using Linked Data to Evaluate the Impact of Research and Development in Europe</a>: A Structural Equation Model [<a href='#ref-14'>14</a>], presents the feasibility of combining different LOD sources to assess the impact of one variable over others.</p> <p><a href='http://tylervigen.com/'>Spurious Correlations</a> [<a href='#ref-15'>15</a>] reveals correlations that are not genuine for practical use. In other words, the correlations are type I errors. It emphasizes on the importance for humans to be critical of random correlations, and to investigate whether there is a direct relation between the variables.</p> <p><a href='http://www.ontologymatching.org/'>Ontology Matching</a> [<a href='#ref-16'>16</a>] is perhaps the most mature field in the Semantic Web dealing with the general problem of finding semantically related entities of ontologies and Linked Data, although resources like WordNet and DBpedia are also related.</p> <p>These studies and engineering efforts have created, inspected, and hypothesized possible correlations. However, the missing gap in research is that there is no integrated study on how semantic relatedness between datasets may enhance the detection of meaningful or useful correlations in statistical data. Our contribution is the investigation of highly probably elements which would lead to better prediction of interesting correlations by employing linked statistical datasets and semantic analysis.</p>

research-design

<p><strong>Research problem</strong>: Why do machines have difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets? Put another way: How can machines uncover interesting correlations?</p> <p>Over this ongoing investigation, we want to uncover some of the fundamental components for measuring and declaring semantic similarity between datasets, in order to better predict relevant strong relationships. Can semantic relatedness between datasets imply statistical correlation of the related data points in the datasets?</p>

ProblemStatement

Description: <p><strong>Research problem</strong>: Why do machines have difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets? Put another way: How can machines uncover interesting correlations?</p> <p>Over this ongoing investigation, we want to uncover some of the fundamental components for measuring and declaring semantic similarity between datasets, in order to better predict relevant strong relationships. Can semantic relatedness between datasets imply statistical correlation of the related data points in the datasets?</p>
Name: research-design

results

<p>All of the experiment results are available at the <a href='https://github.com/csarven/lsd-sense'>LSD Sense</a> GitHub repository, and can be reproduced. Table [<a href='#experiment-results'>Experiment results</a>] provides our findings, with Figures [<a href='#figure_lsd-sense-worldbank-2012'>1</a>] and [<a href='#figure_lsd-sense-worldbank-2012-topic-4'>2</a>]:</p> <table id='experiment-results'> <caption>Experiment Results</caption> <thead> <tr> <th></th> <th>All topics</th> <th>One topic (<em>education</em>)</th> </tr> </thead> <tfoot> <tr><td colspan='3'>Datasets are from 2012 World Bank indicators. n is the number of dataset pairs with semantic similarity and correlation as variables.</td></tr> </tfoot> <tbody> <tr><th>Correlation</th><td>0.182</td><td>0.227</td></tr> <tr><th><em>p</em>-value</th><td>< 2.2e-16</td><td>< 2.2e-16</td></tr> <tr><th>n</th><td>92819</td><td>33184</td></tr> </tbody> </table> <div class='figure-column-2'> <figure id='figure_lsd-sense-worldbank-2012'> <img alt='Figure of scatter plot showing 2012 World Bank indicators with all topics' height='300' width='300' src='/media/images/articles/lsd-sense-worldbank-2012.png'></img> <figcaption>2012 World Bank indicators with all topics</figcaption> </figure> <figure id='figure_lsd-sense-worldbank-2012-topic-4'> <img alt='Figure of scatter plot showing 2012 World Bank indicators with topic education' height='300' width='300' src='/media/images/articles/lsd-sense-worldbank-2012-4.png'></img> <figcaption>2012 World Bank indicators with topic education.</figcaption> </figure> </div> <p>Given that both experiments resulted in <em>p</em>-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis. For extra measure, we can also verify the meaninglessness by looking at the plots. There is <strong>nothing</strong> interesting <strong>to see here</strong>. We will <strong>move along</strong> with our alternative hypothesis.</p>

Results

Description: <p>All of the experiment results are available at the <a href='https://github.com/csarven/lsd-sense'>LSD Sense</a> GitHub repository, and can be reproduced. Table [<a href='#experiment-results'>Experiment results</a>] provides our findings, with Figures [<a href='#figure_lsd-sense-worldbank-2012'>1</a>] and [<a href='#figure_lsd-sense-worldbank-2012-topic-4'>2</a>]:</p> <table id='experiment-results'> <caption>Experiment Results</caption> <thead> <tr> <th></th> <th>All topics</th> <th>One topic (<em>education</em>)</th> </tr> </thead> <tfoot> <tr><td colspan='3'>Datasets are from 2012 World Bank indicators. n is the number of dataset pairs with semantic similarity and correlation as variables.</td></tr> </tfoot> <tbody> <tr><th>Correlation</th><td>0.182</td><td>0.227</td></tr> <tr><th><em>p</em>-value</th><td>< 2.2e-16</td><td>< 2.2e-16</td></tr> <tr><th>n</th><td>92819</td><td>33184</td></tr> </tbody> </table> <div class='figure-column-2'> <figure id='figure_lsd-sense-worldbank-2012'> <img alt='Figure of scatter plot showing 2012 World Bank indicators with all topics' height='300' width='300' src='/media/images/articles/lsd-sense-worldbank-2012.png'></img> <figcaption>2012 World Bank indicators with all topics</figcaption> </figure> <figure id='figure_lsd-sense-worldbank-2012-topic-4'> <img alt='Figure of scatter plot showing 2012 World Bank indicators with topic education' height='300' width='300' src='/media/images/articles/lsd-sense-worldbank-2012-4.png'></img> <figcaption>2012 World Bank indicators with topic education.</figcaption> </figure> </div> <p>Given that both experiments resulted in <em>p</em>-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis. For extra measure, we can also verify the meaninglessness by looking at the plots. There is <strong>nothing</strong> interesting <strong>to see here</strong>. We will <strong>move along</strong> with our alternative hypothesis.</p>

Sarven Capadisli

Person

Email: mailto:info@csarven.ca
Image
MemberOf: Bern University of Applied Sciences; University of Bonn
Name: enSarven Capadisli
Url: https://csarven.ca/; Sarven Capadisli

Sarven Capadisli replied on 2014-07-22 15:11:39

<div typeof='oa:TextualBody' resource='#note-20140722151139' property='rdf:value' datatype='rdf:HTML'> <p>Thanks Stian. The source data was too big to include at GitHub. I could offer it up on this site I suppose.</p> <p>What I'm actually planning to do is simply referring to the analysis and provenance (activity) URIs instead at stats.270a.info.</p> <p>The source data can still be produced by running the scripts (hopefully nothing goes chaos on your end) :)</p></div>

Actor: Sarven Capadisli
AnnotatedAt: 22 July 2014 17:11
AnnotatedBy: Sarven Capadisli
Content: <div typeof='oa:TextualBody' resource='#note-20140722151139' property='rdf:value' datatype='rdf:HTML'> <p>Thanks Stian. The source data was too big to include at GitHub. I could offer it up on this site I suppose.</p> <p>What I'm actually planning to do is simply referring to the analysis and provenance (activity) URIs instead at stats.270a.info.</p> <p>The source data can still be produced by running the scripts (hopefully nothing goes chaos on your end) :)</p></div>
Creator: Sarven Capadisli
DatePublished: 22 July 2014 17:11
Description: <div typeof='oa:TextualBody' resource='#note-20140722151139' property='rdf:value' datatype='rdf:HTML'> <p>Thanks Stian. The source data was too big to include at GitHub. I could offer it up on this site I suppose.</p> <p>What I'm actually planning to do is simply referring to the analysis and provenance (activity) URIs instead at stats.270a.info.</p> <p>The source data can still be produced by running the scripts (hopefully nothing goes chaos on your end) :)</p></div>
HasBody: note-20140722151139
HasTarget: Stian Soiland-Reyes replied on 2014-07-22 15:04:42
InReplyTo: Stian Soiland-Reyes replied on 2014-07-22 15:04:42
MotivatedBy: replying
Name: enSarven Capadisli replied on 2014-07-22 15:11:39
Published: 22 July 2014 17:11

Semantic Similarity and Correlation of Linked Statistical Data Analysis

About

Correlation and dependence

Abstract

en Statistical data is increasingly made available in the form of Linked Data on the Web. As more and more statistical datasets become available, a fundamental question on statistical data comparability arises: To what extent can arbitrary statistical datasets be faithfully compared? Besides a purely statistical comparability, we are interested in the role that semantics plays in the data to be compared. Our hypothesis is that semantic relationships between different components of statistical datasets might have a relationship with their statistical correlation. Our research focuses in studying whether these statistical and semantic relationships influence each other, by comparing the correlation of statistical data with their semantic similarity. The ongoing research problem is, hence, to investigate why machines have a difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets. We describe a fully reproducible pipeline to compare statistical correlation with semantic similarity in arbitrary Linked Statistical Data. We present a use case using World Bank data expressed as RDF Data Cube, and we highlight whether dataset titles can help predict strong correlations.

Author

A3

Name: http://www.albertmeronyo.org/

Citation

citation.cfm?id=1458105

hc6.pdf

http://270a.info/

http://orion.tw.rpi.edu/~jimmccusker/qb.js/

http://stats.270a.info/

http://tylervigen.com/

http://worldbank.270a.info/

index.html

ISWC2013 AZ LODSEM public.pdf

linked-statistical-data-analysis

the-joy-stats-meaningless-and-meaningful-correlations

Creator

Sarven Capadisli

DateModified

21 March 2015 11:00

DatePublished

21 July 2014 11:00

HasPart

A2

Name: results

Acknowledgements

Conclusions and Future Work

Related Work

Name: enRelated Work

related-work

results

World Bank's education topic

World Bank indicators

Has reply

Anonymous Reviewer replied on 2014-10-14 01:48:33

Anonymous Reviewer replied on 2014-10-18 00:30:02

Anonymous Reviewer replied on 2014-10-18 00:30:27

Armin Haller replied on 2014-10-18 00:31:28

Paul Groth replied on 2014-08-08 14:44:34

Stian Soiland-Reyes replied on 2014-07-22 15:04:42

TM replied on 2016-01-21 08:59:58

TM replied on 2016-01-21 12:35:44

Tom Passin replied on 2014-07-22 14:22:23

License

https://creativecommons.org/licenses/by/4.0/

Name

enSemantic Similarity and Correlation of Linked Statistical Data Analysis

Num replies

Publisher

Sarven Capadisli

Purpose

enMaking “sense” of Linked Statistical Data and Analysis.

RelatedLink

dokieli-rww

linked-data-notifications

linked-specifications-reports

sparqlines-sparql-to-sparkline

web-science-from-404-to-200

semantic similarity

SIO 000367

IsPartOf: H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.
Label: ensemantic similarity

Semantic similarity for each dataset pair

<p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p>

Description: <p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p>
Name: enSemantic similarity for each dataset pair

SoerenAuer

Email: mailto:auer@cs.uni-bonn.de

Stian Soiland-Reyes

Person

Image
Name: enStian Soiland-Reyes
Url: Stian Soiland-Reyes

Stian Soiland-Reyes replied on 2014-07-22 15:04:42

<div typeof='oa:TextualBody' resource='#note-20140722150442' property='rdf:value' datatype='rdf:HTML'> <p>Thanks for publishing your negative results; I wish more people did! This article also clearly detail your methodology and hypothesis, in addition to providing source code and data. </p> <p>One thing - intermediate values? (I am a provenance guy after all..) I browsed through the github source, but was unable to find the source data after cleaning/preprocessing - I wanted to verify how much remained after stripping out units etc.</p></div>

Actor: Stian Soiland-Reyes
AnnotatedAt: 22 July 2014 17:04
AnnotatedBy: Stian Soiland-Reyes
Content: <div typeof='oa:TextualBody' resource='#note-20140722150442' property='rdf:value' datatype='rdf:HTML'> <p>Thanks for publishing your negative results; I wish more people did! This article also clearly detail your methodology and hypothesis, in addition to providing source code and data. </p> <p>One thing - intermediate values? (I am a provenance guy after all..) I browsed through the github source, but was unable to find the source data after cleaning/preprocessing - I wanted to verify how much remained after stripping out units etc.</p></div>
Creator: Stian Soiland-Reyes
DatePublished: 22 July 2014 17:04
Description: <div typeof='oa:TextualBody' resource='#note-20140722150442' property='rdf:value' datatype='rdf:HTML'> <p>Thanks for publishing your negative results; I wish more people did! This article also clearly detail your methodology and hypothesis, in addition to providing source code and data. </p> <p>One thing - intermediate values? (I am a provenance guy after all..) I browsed through the github source, but was unable to find the source data after cleaning/preprocessing - I wanted to verify how much remained after stripping out units etc.</p></div>
HasBody: note-20140722150442
Has reply: Sarven Capadisli replied on 2014-07-22 15:11:39
HasTarget: Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo: Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy: replying
Name: enStian Soiland-Reyes replied on 2014-07-22 15:04:42
Num replies: 1
Published: 22 July 2014 17:04

Sören Auer

Person

MemberOf: University of Bonn
Name: enSören Auer
Url: http://www.iai.uni-bonn.de/~auer/

TM

Person

Image
Name: enTM
Url: TM

TM replied on 2016-01-21 08:59:58

<div typeof='oa:TextualBody' resource='#note-20160121085958' property='rdf:value' datatype='rdf:HTML'> <p>I'm confused by your hypothesis test. You specify H0 as "There exists a significant relationship" etc. You did find a strong statistical relationship (p-value < 2.2e-16). You then comment: "Given that both experiments resulted in p-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis."</p> <p>I have never seen a hypothesis test like that. What would it have taken to not reject the hypothesis?</p> <p>Regarding your research question (as far as I understand it), I find it difficult to make sense of. Why should we care whether correlated datasets have similar titles? Sure, it wouldn't be surprising to find that male and female literacy rates are correlated, but then what? What non-obvious insight do we derive from that observation? On the other hand, if we want to causally explain those literacy rates and why they differ for different regions/demographics, we would surely have to investigate variables related to economic and social development and there is no reason whatsoever to expect that the relevant explanatory variables will be semantically similar to the variables we attempt to explain.</p></div>

Actor: TM
AnnotatedAt: 21 January 2016 10:59
AnnotatedBy: TM
Content: <div typeof='oa:TextualBody' resource='#note-20160121085958' property='rdf:value' datatype='rdf:HTML'> <p>I'm confused by your hypothesis test. You specify H0 as "There exists a significant relationship" etc. You did find a strong statistical relationship (p-value < 2.2e-16). You then comment: "Given that both experiments resulted in p-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis."</p> <p>I have never seen a hypothesis test like that. What would it have taken to not reject the hypothesis?</p> <p>Regarding your research question (as far as I understand it), I find it difficult to make sense of. Why should we care whether correlated datasets have similar titles? Sure, it wouldn't be surprising to find that male and female literacy rates are correlated, but then what? What non-obvious insight do we derive from that observation? On the other hand, if we want to causally explain those literacy rates and why they differ for different regions/demographics, we would surely have to investigate variables related to economic and social development and there is no reason whatsoever to expect that the relevant explanatory variables will be semantically similar to the variables we attempt to explain.</p></div>
Creator: TM
DatePublished: 21 January 2016 10:59
Description: <div typeof='oa:TextualBody' resource='#note-20160121085958' property='rdf:value' datatype='rdf:HTML'> <p>I'm confused by your hypothesis test. You specify H0 as "There exists a significant relationship" etc. You did find a strong statistical relationship (p-value < 2.2e-16). You then comment: "Given that both experiments resulted in p-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis."</p> <p>I have never seen a hypothesis test like that. What would it have taken to not reject the hypothesis?</p> <p>Regarding your research question (as far as I understand it), I find it difficult to make sense of. Why should we care whether correlated datasets have similar titles? Sure, it wouldn't be surprising to find that male and female literacy rates are correlated, but then what? What non-obvious insight do we derive from that observation? On the other hand, if we want to causally explain those literacy rates and why they differ for different regions/demographics, we would surely have to investigate variables related to economic and social development and there is no reason whatsoever to expect that the relevant explanatory variables will be semantically similar to the variables we attempt to explain.</p></div>
HasBody: note-20160121085958
HasTarget: Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo: Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy: replying
Name: enTM replied on 2016-01-21 08:59:58
Published: 21 January 2016 10:59

TM replied on 2016-01-21 12:35:44

<div typeof='oa:TextualBody' resource='#note-20160121123544' property='rdf:value' datatype='rdf:HTML'> <p>Addendum: It is also bewildering that several commenters describe this as a "negative result". In standard statistical terminology, rejection of the null hypothesis is a positive (i. e. interesting, unexpected) result.</p></div>

Actor: TM
AnnotatedAt: 21 January 2016 14:35
AnnotatedBy: TM
Content: <div typeof='oa:TextualBody' resource='#note-20160121123544' property='rdf:value' datatype='rdf:HTML'> <p>Addendum: It is also bewildering that several commenters describe this as a "negative result". In standard statistical terminology, rejection of the null hypothesis is a positive (i. e. interesting, unexpected) result.</p></div>
Creator: TM
DatePublished: 21 January 2016 14:35
Description: <div typeof='oa:TextualBody' resource='#note-20160121123544' property='rdf:value' datatype='rdf:HTML'> <p>Addendum: It is also bewildering that several commenters describe this as a "negative result". In standard statistical terminology, rejection of the null hypothesis is a positive (i. e. interesting, unexpected) result.</p></div>
HasBody: note-20160121123544
HasTarget: Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo: Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy: replying
Name: enTM replied on 2016-01-21 12:35:44
Published: 21 January 2016 14:35

Test and verify hypothesis.

WorkflowTemplateProcess

IsStepOfTemplate: Workflow
Label: enTest and verify hypothesis.

Tom Passin

Person

Image
Name: enTom Passin

Tom Passin replied on 2014-07-22 14:22:23

<div typeof='oa:TextualBody' resource='#note-20140722142223' property='rdf:value' datatype='rdf:HTML'> <p>On reading the paper, I see you tried to do on a large scale much what I did on a small scale in 2003, namely, to get some mileage out of titles only. I found a way to enhance the semantic content of these particular titles (browser bookmarks). I found the approach to be of value in retrieval. You might like to read my paper on this, "Browser bookmark management with Topic Maps", at</p> <p><a href='http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html'>http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html</a></p> <p>I also played with a very different title clustering method, and you might like reading that work, too:</p> <p>"On-the-fly Clustering As A Novel RDF Query Mechanism" at</p> <p>http://conferences.idealliance.org/extreme/html/2004/Passin01/EML2004Passin01.html</p> <p> Tom Passin</p></div>

AnnotatedAt: 22 July 2014 16:22
Content: <div typeof='oa:TextualBody' resource='#note-20140722142223' property='rdf:value' datatype='rdf:HTML'> <p>On reading the paper, I see you tried to do on a large scale much what I did on a small scale in 2003, namely, to get some mileage out of titles only. I found a way to enhance the semantic content of these particular titles (browser bookmarks). I found the approach to be of value in retrieval. You might like to read my paper on this, "Browser bookmark management with Topic Maps", at</p> <p><a href='http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html'>http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html</a></p> <p>I also played with a very different title clustering method, and you might like reading that work, too:</p> <p>"On-the-fly Clustering As A Novel RDF Query Mechanism" at</p> <p>http://conferences.idealliance.org/extreme/html/2004/Passin01/EML2004Passin01.html</p> <p> Tom Passin</p></div>
Creator: enTom Passin
DatePublished: 22 July 2014 16:22
Description: <div typeof='oa:TextualBody' resource='#note-20140722142223' property='rdf:value' datatype='rdf:HTML'> <p>On reading the paper, I see you tried to do on a large scale much what I did on a small scale in 2003, namely, to get some mileage out of titles only. I found a way to enhance the semantic content of these particular titles (browser bookmarks). I found the approach to be of value in retrieval. You might like to read my paper on this, "Browser bookmark management with Topic Maps", at</p> <p><a href='http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html'>http://conferences.idealliance.org/extreme/html/2003/Passin01/EML2003Passin01.html</a></p> <p>I also played with a very different title clustering method, and you might like reading that work, too:</p> <p>"On-the-fly Clustering As A Novel RDF Query Mechanism" at</p> <p>http://conferences.idealliance.org/extreme/html/2004/Passin01/EML2004Passin01.html</p> <p> Tom Passin</p></div>
HasBody: note-20140722142223
HasTarget: Semantic Similarity and Correlation of Linked Statistical Data Analysis
InReplyTo: Semantic Similarity and Correlation of Linked Statistical Data Analysis
MotivatedBy: replying
Name: enTom Passin replied on 2014-07-22 14:22:23
Published: 22 July 2014 16:22

University of Bonn

Organization

Name: enUniversity of Bonn
Url: http://uni-bonn.de/

VU University Amsterdam

Organization

Name: enVU University Amsterdam
Url: http://vu.nl/

Workflow

<p>Based on preliminary experimentation from data acquisition to analysis, we have created the <span property='rdfs:label'>LSD Sense workflow</span>:</p> <ol> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-hypothesis' id='workflow-create-hypothesis'>Create hypothesis</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-configure' id='workflow-configure'>Determine datasets and configurations</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-get-metadata-lsd' id='workflow-get-metadata-lsd'>Get metadata of datasets.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-get-observations-lsd' id='workflow-get-observations-lsd'>Get each dataset's observations.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-lsd' id='workflow-create-analysis-lsd'>Create correlations and other analysis for each dataset pair combination.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-preprocess-semantic-similarity' id='workflow-create-preprocess-semantic-similarity'>Create dataset metadata subset for semantic similarity.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-semantic-similarity' id='workflow-create-analysis-semantic-similarity'>Create semantic similarity for each dataset pair combination.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-semantic-similarity-correlation' id='workflow-create-analysis-semantic-similarity-correlation'>Create correlation and other analysis using variables semantic similarity and correlation of LSD.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-test-verify-hypothesis' id='workflow-test-verify-hypothesis'>Test and verify hypothesis.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-analysis' id='workflow-analysis'>Analysis.</li> </ol>

Description: <p>Based on preliminary experimentation from data acquisition to analysis, we have created the <span property='rdfs:label'>LSD Sense workflow</span>:</p> <ol> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-hypothesis' id='workflow-create-hypothesis'>Create hypothesis</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-configure' id='workflow-configure'>Determine datasets and configurations</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-get-metadata-lsd' id='workflow-get-metadata-lsd'>Get metadata of datasets.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-get-observations-lsd' id='workflow-get-observations-lsd'>Get each dataset's observations.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-lsd' id='workflow-create-analysis-lsd'>Create correlations and other analysis for each dataset pair combination.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-preprocess-semantic-similarity' id='workflow-create-preprocess-semantic-similarity'>Create dataset metadata subset for semantic similarity.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-semantic-similarity' id='workflow-create-analysis-semantic-similarity'>Create semantic similarity for each dataset pair combination.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-create-analysis-semantic-similarity-correlation' id='workflow-create-analysis-semantic-similarity-correlation'>Create correlation and other analysis using variables semantic similarity and correlation of LSD.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-test-verify-hypothesis' id='workflow-test-verify-hypothesis'>Test and verify hypothesis.</li> <li property='rdfs:label' resource='#workflow' rel='opmw:isStepOfTemplate' typeof='opmw:WorkflowTemplateProcess' about='#workflow-analysis' id='workflow-analysis'>Analysis.</li> </ol>
Label: enLSD Sense workflow
Name: enWorkflow

World Bank indicators

DatasetDescription

Label: enWorld Bank indicators
SeeAlso: world-bank-indicators

World Bank's education topic

DatasetDescription

Label: enWorld Bank's education topic
SeeAlso: 4

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

World Bank Indicators workflow

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

A3

A2

Related Work

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks

Backlinks