SPARQL Tutorial for Data Exploration and Information Visualization

Miika Alonen and Tomi Kauppinen

Linked Open Data (LOD) is a very promising approach and group of technologies for supporting reuse and mashing up of  data, be it from one or more different disciplines. With this tutorial our aim is to show that LOD is relatively easy to use for the tasks of data exploration and information visualization.

The amount of available Linked Open Data and SPARQL-endpoints serving the data are increasing. However, at the same time the documentation and statistics of the vocabularies used in data is often lacking. Similarly, it is often hard to find what actual content a given dataset contains.

With this tutorial we hope to support exploration of data provided by the SPARQL-endpoints, and to understand contents and vocabularies used to describe them. To follow this tutorial, you should have some previous knowledge about SPARQL and Linked Data. If you are new to Linked Data then sites like  LinkedDataTools.org provide a good starting point. There are also some books which can help you learn SPARQL.

Also if you are wondering where to find some data, the Data Hub catalogue will help you to find useful sets of data on the Internet. Data Hub Catalogue contains amazing amount of data in different formats. One way to find interesting SPARQL-endpoints, is to look at the Modeca uptime service. Modeca monitors the availability of all SPARQL endpoints found in the catalogue of the Data Hub. Contents of the Data Hub can also be queried using SPARQL.

The setting for the our tutorial is that user does not have to know anything about contents found in the given endpoint before the exploration. The SPARQL queries we use in this tutorial are thus very generic. Once a user has got better idea about data available in then more precise queries can be designed.

Let us start, and make use of the SPARQL visualization tool provided by the Linked Open Aalto Data Service for running the queries. The tutorial consists of the following set of handy SPARQL queries for exploring data. Each query is also explained, and there is an illustration of the results. Moreover you may run each query see the results online yourself.

Find out all datasets (i.e. graphs) and calculate how many triples there are per graph


SELECT DISTINCT ?g (count(?p) as ?triples) 
WHERE { GRAPH ?g { ?s ?p ?o } } 
GROUP BY ?g

(Run query  at the SPARQL endpoint data.aalto.fi/sparql)

You could also visualize the results, using the inline editor. For example show the amount of the triples in the graphs on a logarithmic scale:

Find out which classes are in use in the data found in a given SPARQL endpoint:


SELECT ?class (count(?s) as ?count)  
WHERE { ?s a ?class } 
GROUP BY ?class

(Run query  at the SPARQL endpoint data.aalto.fi/sparql)

Results from this query could be  presented as a Pie Chart, visualizing the distribution of class usage:

Now that we know what classes are being used in the endpoint, we can investigate further which predicates are used with certain class, for example foaf:Person:


SELECT ?p (count(?p) as ?count) 
WHERE { [a <http://xmlns.com/foaf/0.1/Person>] ?p ?o } 
GROUP BY ?p

(Run query at the SPARQL endpoint data.aalto.fi/sparql)

Result of this query show the actual usage of properties in the data. For example, this visualization created from this query reveals that even though the amount of foaf:Persons is 64,7% from all of the triples, only around 1000 of instances have more properties than two:

Querying the whole endpoint is usually not the most efficient or meaningful thing to do. It is also a good idea to minimize the computational complexity of the SPARQL-queries. This can be done by querying certain graphs or using LIMIT-option with the queries. So here is the query again which will query all the classes used in the graph which contains only the linked data generated from the people profiles:


SELECT ?class (count(?s) as ?count) 
FROM NAMED <http://data.aalto.fi/id/people/> 
WHERE { ?s a ?class } 
GROUP BY ?class 

(Run query at the SPARQL endpoint data.aalto.fi/sparql)


SELECT ?p (count(?p) as ?count)
FROM NAMED <http://data.aalto.fi/id/people/>
WHERE { ?s ?p ?o } 
GROUP BY ?p

(Run query at the SPARQL endpoint data.aalto.fi/sparql)

Now this would be the data we are interested in, and it is now much easier to start writing the queries once we know what classes and properties are actually used in the data set we are interested in.

 

Leave a Reply