Analyzing and Visualizing Linked Data from the Aalto University with R

by Tomi Kauppinen and Miika Alonen

Contents from universities are of interest in many ways. Think of data about course schedules, seminar rooms and buildings, publications details, scientists and teaching personnel themselves, and if we could analyze all of this content together. Well, with some universities now publishing their contents as Linked Data we can do the analysis.

For example, having explicit connections between a teacher and her course, publications, and physical location, would allow for quite a range of applications from visualizing contents on timelines, maps, and charts. Another aspect is to statistically analyze the data, and find out interesting correlations between various phenomena.

Now with this tutorial we do exactly this. We take data from the Linked Open Aalto (data.aalto.fi) and analyze it with the power of R and by using the SPARQL package. As an example for this tutorial we take course data described using the TEACH vocabulary and other relevant vocabularies.

So, open up your R, install the SPARQL package and follow the instructions below.

Let us start by defining the necessary packages, endpoints and prefixes.

library(SPARQL)

endpoint <- "http://data.aalto.fi/sparql"
options <- "output=xml"

prefix <- c("teach","http://linkedscience.org/teach/ns#",
"geo","http://www.geonames.org/ontology#")
sparql_prefix <- "PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX teach: <http://linkedscience.org/teach/ns#>
PREFIX aiiso: <http://purl.org/vocab/aiiso/schema#>
PREFIX ical: <http://www.w3.org/2002/12/cal/icaltzd#>"

We move on to define a query for getting all of the departments

q <- paste(sparql_prefix,
           "SELECT DISTINCT ?department ?name ?org where {
              ?department aiiso:code ?code ; foaf:name ?name ; aiiso:teaches ?course ;
              aiiso:part_of [ foaf:name ?org ] . ?course teach:arrangedAt [ical:dtstart ?start]
              FILTER(lang(?name)='en' && lang(?org)='en')
            } GROUP BY ?org ?department ?name ORDER BY desc(?org) ?name")

This query will return department uris and labels for the deparments. To get the results for the query, use SPARQL-function with the parameters specified above.

res <- SPARQL(endpoint,q,ns=prefix,extra=options)$results
res$department

Uris for the departments will be shown as a list as below:

[1] "<http://data.aalto.fi/id/courses/noppa/dept_T3040>"  "<http://data.aalto.fi/id/courses/noppa/dept_T3010>"
 [3] "<http://data.aalto.fi/id/courses/noppa/dept_T3050>"  "<http://data.aalto.fi/id/courses/noppa/dept_T3070>"
 [7] "<http://data.aalto.fi/id/courses/noppa/dept_T3030>"  "<http://data.aalto.fi/id/courses/noppa/dept_T3090>" ...

Now we can use these uris to do more specific queries and comparisons with the departments. Lets define a loop that queries all of the departments and courses taught by the departments.

for(i in 1:length(res$department)) {

queryStart <- "SELECT (xsd:string(?month) as ?month) (count(?course) as ?lectures) WHERE { "
queryEnd <- " aiiso:teaches ?course . ?course teach:arrangedAt [ical:dtstart ?start] . } GROUP BY (substr(str(?start),6,2) as ?month) ORDER BY ?month"
q <- paste(sparql_prefix,queryStart,res$department[i],queryEnd)
response <- SPARQL(endpoint,q,ns=prefix,extra=options)$results

start = as.numeric("01")
full <- seq(start, by=1, length=12)
partialMedia <- data.frame(date=as.numeric(response[,1]),value=response[,2])
with(partialMedia, value[match(full, date)])
dMedia <- data.frame(Date=full, value=with(partialMedia, value[match(full, date)]))
dMedia[is.na(dMedia)] <- 0

resList[[res$department[i]]] = dMedia

plot(dMedia,type="b",xlab="month",main=res$name[i],ylab="number of lectures",pch=16,col="purple")
lines(dMedia,type="b",pch=16,col="red")
}

Now we can see line-plots for all of the lecture counts by different Aalto-departments. In the example below we illustrate the aggregated lecture counts by each month for two departments at the Aalto University: Department of Art (one the left) and Department of Media Technology (on the right):

The results are also pushed to a list of responses that can be used by typing “resList” to get all of the results and resList[“<URI>”] to get single department by uri. Now you can use this information for further analysis, for example explore the correlations between the departments:

cor(resList$'<http://data.aalto.fi/id/courses/noppa/dept_T3030>'$value,resList$'<http://data.aalto.fi/id/courses/noppa/dept_A805>'$value,use="pair",method = c("pearson"))

As a result we get 0.5835419 for the correlation between monthly lecture counts between the Department of Art and the Department of Media Technology.

Here is the full code:

library(SPARQL)

endpoint <- "http://data.aalto.fi/sparql"
options <- "output=xml"

prefix <- c("teach","http://linkedscience.org/teach/ns#",
"geo","http://www.geonames.org/ontology#")
sparql_prefix <- "PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX teach: <http://linkedscience.org/teach/ns#>
PREFIX aiiso: <http://purl.org/vocab/aiiso/schema#>
PREFIX ical: <http://www.w3.org/2002/12/cal/icaltzd#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/> "

q <- paste(sparql_prefix,
"SELECT DISTINCT ?department ?name ?org where {
?department aiiso:code ?code ; foaf:name ?name ; aiiso:teaches ?course ;
aiiso:part_of [ foaf:name ?org ] . ?course teach:arrangedAt [ical:dtstart ?start]
FILTER(lang(?name)='en' && lang(?org)='en')
} GROUP BY ?org ?department ?name ORDER BY desc(?org) ?name")

res <- SPARQL(endpoint,q,ns=prefix,extra=options)$results
res$department

resList <- list()

for(i in 1:length(res$department)) {

queryStart <- "SELECT (xsd:string(?month) as ?month) (count(?course) as ?lectures) WHERE { "
queryEnd <- " aiiso:teaches ?course . ?course teach:arrangedAt [ical:dtstart ?start] . } GROUP BY (substr(str(?start),6,2) as ?month) ORDER BY ?month"
q <- paste(sparql_prefix,queryStart,res$department[i],queryEnd)
response <- SPARQL(endpoint,q,ns=prefix,extra=options)$results

start = as.numeric("01")
full <- seq(start, by=1, length=12)
partialMedia <- data.frame(date=as.numeric(response[,1]),value=response[,2])
with(partialMedia, value[match(full, date)])
dMedia <- data.frame(Date=full, value=with(partialMedia, value[match(full, date)]))
dMedia[is.na(dMedia)] <- 0

resList[[res$department[i]]] = dMedia

plot(dMedia,type="b",xlab="month",main=res$name[i],ylab="number of lectures",pch=16,col="purple")
lines(dMedia,type="b",pch=16,col="red")
}

resList

## Type "resList" to get of the all values and resList["<URI>"] to get single department
## For values type resList$'<URI>'$value, for Dates resList$'<URI>'$Date
## Correlation between two departments: Department of Media Technology and the Department of Art
cor(resList$'<http://data.aalto.fi/id/courses/noppa/dept_T3030>'$value,resList$'<http://data.aalto.fi/id/courses/noppa/dept_A805>'$value,use="pair",method = c("pearson"))

Leave a Reply