Amazon – health data

This tutorial reads the pixel based deforestation data for the Amazon region and computes the area weighted deforestation per municipality. This data is linked with data provided by IBGE reporting the number of deaths for a set of given reasons for several years. To avoid distortions of these numbers due to the different population sizes, the population weighted average is calculated and used to compute correlation values.

The data has been composed as linked data by Jim Jones (jim.jones@uni-muenster.de), the R-script is provided by Benedikt Gräler (ben.graeler@uni-muenster.de).

Load necessary libraries:

library(SPARQL) # version > 1.9
library(spacetime)
library(lattice)

The data is available online at the following endpoint

# setting the endpoint
endpoint <- "http://spatial.linkedscience.org/sparql"

The deforestation data is provided pixel based as DEFOR_YEAR. Additionally, the area overlap between pixels and municipalities is provided as triples. This allows to compute the area deforested per municipality within the SPARQL-query as follows:

# defining a first query
q <- "PREFIX amazon: <http://spatial.linkedscience.org/context/amazon/>
      PREFIX tisc: <http://observedchange.com/tisc/ns#>
      PREFIX owl: <http://www.w3.org/2002/07/owl#>
      SELECT DISTINCT ?muni ?aka 
        (SUM (?overlapArea) AS ?area)
        (SUM (?defor2002 * ?overlapArea) AS ?def02)
        (SUM (?defor2003 * ?overlapArea) AS ?def03)
        (SUM (?defor2004 * ?overlapArea) AS ?def04)
        (SUM (?defor2005 * ?overlapArea) AS ?def05)
        (SUM (?defor2006 * ?overlapArea) AS ?def06)
        (SUM (?defor2007 * ?overlapArea) AS ?def07)
        (SUM (?defor2008 * ?overlapArea) AS ?def08)
      WHERE {
        ?cell amazon:DEFOR_2002 ?defor2002 .
        ?cell amazon:DEFOR_2003 ?defor2003 .
        ?cell amazon:DEFOR_2004 ?defor2004 .
        ?cell amazon:DEFOR_2005 ?defor2005 .
        ?cell amazon:DEFOR_2006 ?defor2006 .
        ?cell amazon:DEFOR_2007 ?defor2007 .
        ?cell amazon:DEFOR_2008 ?defor2008 .
        ?overlapObject tisc:partialOverlapFrom ?cell .
        ?overlapObject tisc:partialOverlapTo ?muni .
        ?overlapObject tisc:partialOverlapArea ?overlapArea .
        ?aka owl:sameAs ?muni .
      } GROUP BY ?muni ?aka"
ns <- c("amazon","http://spatial.linkedscience.org/context/amazon/")
muniDefor <- SPARQL(endpoint, q, ns=ns, extra=list(output="csv"), format="csv")$result

Note, that the variable “?aka” is necessary as the  municipalities’ identifiers change between a six and seven digit code depending on the data source. The relation is modeled as an “sameAs”. The structure of the retrieved data can be investigated with:

str(muniDefor)

returning

'data.frame': 887 obs. of 10 variables:
$ muni : Factor w/ 887 levels "amazon:BRAZIL_MUNICIPALITY_1100015",..: 268 416 725 783 410 257 824 433 760 755 ...
$ aka : Factor w/ 887 levels "amazon:BRAZIL_MUNICIPALITY_110001",..: 268 416 725 783 410 257 824 433 760 755 ...
$ area : num 1397 774 7487 1339 10036 ...
$ def02: num 21.32 1.02 1.91 18.51 3.28 ...
$ def03: num 27.393 1.6925 0.0966 11.2062 2.2746 ...
$ def04: num 31.2351 1.8865 0.0306 6.1923 0.6461 ...
$ def05: num 24.703 0.136 1.03 10.591 8.371 ...
$ def06: num 1.8204 0.0909 0.0323 2.729 0.1924 ...
$ def07: num 10.895 1.446 0.457 0.775 0.983 ...
$ def08: num 16.864 0.518 1.282 1.706 0.904 ...

As we are interested in the geometries of the municipalities as well, we retrieve their polygons’ coordinates through:

# query the municipalties' geometries
muniDefor$geometry <- as.character(1:nrow(muniDefor))
for(i in 1:nrow(muniDefor)) {
  q_muni <- paste("PREFIX amazon: <http://spatial.linkedscience.org/context/amazon/>
                   PREFIX tisc: <http://observedchange.com/tisc/ns#>
                   SELECT ?geom where { ",muniDefor$muni[i]," tisc:geometry ?geom }",sep="")
  muniDefor$geometry[i] <- SPARQL(endpoint, q_muni, extra=list(output="csv"), format="csv")$result
}

The data is returned as a factor, for further use we coerce the coordinates to characters and construct polygons from them.

muniDefor$geometry <- lapply(muniDefor$geometry, as.character) 

# construct polygons
allPolys <- NULL
for ( i in 1:nrow(muniDefor)) {
  if(i%%100 == 0) cat(i,"\n") # show progress
  tmpStr <- muniDefor$geometry[[i]]
  tmpPoly <- NULL
  for(j in 1:length(tmpStr)) {
    tmpPoly <- append(tmpPoly, Polygon(matrix(unlist(lapply(strsplit(strsplit(tmpStr[j],";")[[1]],","),as.numeric)), ncol=2,byrow=T)))
  }
  allPolys <- append(allPolys, Polygons(tmpPoly, ID=i))
}

# create SpatialPolygons object
spPoly <- SpatialPolygons(allPolys, proj4string=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"))

To be able to treat the data easily, we construct a Spatio-Temporal Full Data Frame (STFDF) from the package “spacetime”, add the municipalities names and the deforestation values for the provided years:

# creating the STFDF
amazon <- STFDF(sp=spPoly,time=as.POSIXct(c("2002-01-01","2003-01-01","2004-01-01",
                                            "2005-01-01","2006-01-01","2007-01-01",
                                            "2008-01-01")),
                data=data.frame(defor=numeric(length(spPoly)*7)))

# adding municipality names, aka, and area to sp slot
amazon@sp <- SpatialPolygonsDataFrame(amazon@sp, muniDefor[1:3])

str(amazon@sp@data)

returning

'data.frame': 887 obs. of 3 variables:
$ muni: Factor w/ 887 levels "amazon:BRAZIL_MUNICIPALITY_1100015",..: 268 416 725 783 410 257 824 433 760 755 ...
$ aka : Factor w/ 887 levels "amazon:BRAZIL_MUNICIPALITY_110001",..: 268 416 725 783 410 257 824 433 760 755 ...
$ area: num 1397 774 7487 1339 10036 ...

The deforestation data is added to the STFDF’s data slot

# adding deforestation values for 7 years
for(i in 1:7) {
  amazon@data$defor[((i-1)*length(amazon@sp)+1):(i*length(amazon@sp))] <- muniDefor[[i+3]]
}

str(amazon@data)

returning

'data.frame': 6209 obs. of 1 variable:
$ defor: num 21.32 1.02 1.91 18.51 3.28 ...

A plot of the STFDF can be produced by (takes some time due to the large number of detailed polygons)

stplot(amazon[,2:3])

In the following, the addtional data on deaths per reason, municipality and year is retrieved:

# retrieving additional data by municipality
deathData <- NULL 
populData <- NULL
for (aka in amazon@sp$aka) {
  tmp <- paste("PREFIX amazon: <http://spatial.linkedscience.org/context/amazon/>
                PREFIX dbpedia: <http://www.dbpedia.org/resource/>                  PREFIX owl: <http://www.w3.org/2002/07/owl#>
                SELECT ?deaths ?reason ?year ?popul where { \n",
                aka, " amazon:hasObservation ?obs .
                ?obs amazon:observationType dbpedia:Death .
                ?obs amazon:amountProduced ?deaths .
                ?obs amazon:isAbout ?reason .
                ?obs amazon:year ?year . \n",
                aka, " amazon:hasObservation ?obsPop .
                ?obsPop amazon:observationType dbpedia:Population .
                ?obsPop amazon:year ?year.
                ?obsPop amazon:population ?popul . }", sep="")
  tmpRes <- SPARQL(endpoint, tmp, ns=c("dbpedia", "http://www.dbpedia.org/resource/"), extra=list(output="csv"), format="csv")$result
  tmpDRes <- cbind(rep(aka,nrow(tmpRes)),tmpRes[,1:3])
  deathData <- rbind(deathData, tmpDRes)
  tmpPRes <- cbind(rep(aka,5),tmpRes[(1:5)*30,c(4,3)])
  populData <- rbind(populData, tmpPRes)
}
colnames(deathData) <- c("aka","deaths","reason","year")
colnames(populData) <- c("aka","popul","year")
str(deathData)

To combine all the data, we add the newly retrieved death and population data to the STFDF:

 # deaths
for (reason in levels(deathData$reason)) {
  res <- rep(NA,nrow(amazon@data))
  uniqueYears <- unique(deathData$year)
  uniqueYears <- uniqueYears[which(uniqueYears %in% c("2002","2003","2004","2005","2006","2007","2008") )]
  for(year in uniqueYears) {
    sel <- deathData$year==year & deathData$reason==reason
    ids <- match(amazon@sp$aka, deathData[sel,]$aka)
    tId <- which(c("2002","2003","2004","2005","2006","2007","2008") %in% year)
    res[((tId-1)*length(amazon@sp)+1):(tId*length(amazon@sp))] <- deathData[sel,][ids,]$deaths
  }
  amazon@data[[reason]] <- res
  rm(res)
}

# droping the prefix
colnames(amazon@data)[-1] <- lapply(colnames(amazon@data)[-1], function(x) substr(x,9,nchar(x)))

# adding the population data
populLongTab <- NULL
for(year in 2002:2008) {
  sel <- populData$year == year
  if(sum(sel,na.rm=T)==0)
    res <- rep(NA,length(amazon@sp))
  else
    res <- populData[sel,2]
  populLongTab <- c(populLongTab,res)
}
amazon@data[["Population"]] <- populLongTab
rm(populLongTab)
summary(amazon@data)

The entire structure of the STFDF can be inspected by

str(amazon@data)

As the data is only complete for a subset of years, we temporally subset the STFDF and use only years 2005, 2006 and 2007:

subAmazon <- amazon[,4:6]
str(subAmazon@data)

returning

'data.frame':	2661 obs. of  32 variables:
 $ defor                       : num  24.703 0.136 1.03 10.591 8.371 ...
 $ Blood_diseases              : int  1 0 0 0 0 0 0 0 0 0 ...
 $ Childbirth                  : int  1 0 0 0 0 1 0 0 0 0 ...
 $ Chromosome_abnormality      : int  1 0 0 0 0 0 0 0 0 1 ...
 $ Circulatory_system          : int  2 1 3 4 0 3 1 1 0 17 ...
 $ Congenital_disorder         : int  1 0 0 0 0 0 0 0 0 1 ...
 $ Connective_tissue           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Deformities                 : int  1 0 0 0 0 0 0 0 0 1 ...
 $ Endocrine_system            : int  0 0 0 1 0 1 0 0 0 3 ...
 $ Exame_laboratorial          : int  1 0 0 0 0 0 0 0 0 1 ...
 $ External_cause              : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Eye                         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Genetic_disorder            : int  1 0 0 0 0 0 0 0 0 1 ...
 $ Genitourinary_system        : int  2 0 1 2 0 0 0 0 0 7 ...
 $ Human_gastrointestinal_tract: int  1 0 4 0 0 3 1 0 0 4 ...
 $ Human_musculoskeletal_system: int  0 0 0 0 0 0 0 0 0 0 ...
 $ Infectious_disease          : int  3 0 2 1 1 3 0 1 0 15 ...
 $ Lesion                      : int  0 0 1 1 0 1 2 0 0 2 ...
 $ Mastoiditis                 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Medical_test                : int  1 0 0 0 0 0 0 0 0 1 ...
 $ Mental_disorder             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Neoplasm                    : int  2 1 2 0 0 1 1 0 0 2 ...
 $ Nervous_system              : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Nosocomial_infection        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Perinatalperiode            : int  5 1 2 1 0 2 1 0 0 5 ...
 $ Poison                      : int  0 0 1 1 0 1 2 0 0 2 ...
 $ Pregnancy                   : int  1 0 0 0 0 1 0 0 0 0 ...
 $ Puerpera                    : int  1 0 0 0 0 1 0 0 0 0 ...
 $ Respiratory_system          : int  1 0 7 1 0 1 0 0 0 11 ...
 $ Skin                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Subcutaneous_tissue         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Population                  : int  23587 3878 14849 3704 4325 13018 1792 9441 12696 26938 ...

Now, we can investigate relationships between the collected data by Kendall’s rank correlation coefficient tau:

# correlation of deforestation rate and deaths/population
corLine <- cor(subAmazon@data[,1]/subAmazon@sp$area, cbind(subAmazon@data[,2:31]/subAmazon@data[,32],subAmazon@data[,32,drop=F]),
               use="pairwise", method="kendall")

The following plot gives insight in the dependence structure:

barchart(corLine[,], origin=0,col="darkgrey", main="Kendall's tau for defor. rates and rel. deaths in Amazonian Municipalities for 2005, 2006 and 2007",
         xlab="Kendall's tau coorelation coefficient")

colnames(corLine)[which.max(corLine[-31])]

The largest Kendall’s tau values are found for “Population” and death by “Infectious_disease”. Both coefficents are found to be significant based on the following tests:

# test correlation for population
cor.test(subAmazon@data[,1]/subAmazon@sp$area, subAmazon@data[,32], method="kendall",use="pairwise.complete") # p< 0.01

# test correlation for Infectious_disease
cor.test(subAmazon@data[,1]/subAmazon@sp$area, subAmazon@data[,17]/subAmazon@data[,32], method="kendall",use="pairwise.complete") # p< 0.01

To further investigate and plot the data, we add the relative variables and plot the scatter:

# adding relative variables
subAmazon@data[["rel_Infect_dis"]] <- subAmazon@data[["Infectious_disease"]]/subAmazon@data[,32]
subAmazon@data[["Infect_dis_per_25"]] <- subAmazon@data[["Infectious_disease"]]/subAmazon@data[,32]*25
subAmazon@data[["rel_defor"]] <- subAmazon@data[["defor"]]/subAmazon@sp$area 

# look at the scatter
plot(cbind(subAmazon@data[,"rel_defor",drop=F],subAmazon@data[,"rel_Infect_dis"]))

Maps can be plotted by

stplot(subAmazon[,1:2,"rel_Infect_dis"],col.regions=bpy.colors(), main="relative number of deaths by \"infectious disease\"")

spplot(subAmazon[,2],c("Infect_dis_per_25","rel_defor"),col.regions=bpy.colors(), main="deaths/25 caused by \"infectious disease\" vs deforestation rate in 2006")

Leave a Reply