This “POPS” project used semantic web technologies (RDF and SPARQL). It’s from NASA/JPL, but no you don’t have to be a rocket scientist …

I liked the criteria:

  • The application aggregates data from three independently developed sources.
  • The data is used in ways not originally intended (“serendipitous reuse”).
  • The cost of aggregation is low, requiring only a small amount of connective tissue.

And I liked the service agreements

  • We documented the fields we were using and the owner committed to notify us if they changed their data structure.
  • We documented how we intended to use the data, which made people much more comfortable.

Also liked a couple other aspects. Take a look. As this is about finding expertise, it would translate very nicely into a medical environment.

{ Comments on this entry are closed }

Quote of the Day

by John W Rodat on June 3, 2011

“To see what is in front of one’s nose needs a constant struggle” — George Orwell

{ Comments on this entry are closed }

Delayed reaction here, but they did a nice job so I’ll post anyway.

The New York Times received lots of reactions to the the killing of Osama Bin Laden and here’s how they displayed it.

Note that all the structured data are quite visible so that that the viewer can see both concentrations and variability. The unstructured (text) data is available by scrolling over individual data points. The latter has no summary but at least it hasn’t been thrown away.

The work was done by Jon Huang and Aron Pilhofer. Kudos.

{ Comments on this entry are closed }

Carole Goble’s emphases are in the life sciences. But the lessons in Democratizing Informatics for the ‘Long Tail’ Scientist are both generic and important. Substitute your own domain.

Semantic Web is the elephant to a blind man—everything to everybody.

One of the prime pieces of the Semantic Web is the notion you could publish into a common model, facts and information available only as a person-readable means. You’d be able to crosslink and query across these different information resources so machines could manipulate and find information and connect people in new ways. That’s what linked open data is: the original idea was, here’s a Web page, I’ll annotate it, and those facts will be pooled into this knowledge web or information space, which will be used as a platform for doing all sorts of applications. What was underplayed too much in the emerging Semantic Web was that this you could this with datasets—forget web pages, just push datasets out in a common data model, let’s call that ‘RDF’ [the Resource Description Framework]. Then if we had some common identifiers and common vocabularies, we could begin to build bridges between datasets. Imagine the London Underground map—that was the vision, right? But somewhere along the line it got migrated into automated reasoning and AI and inference and rich ontology models, which are great, but miss the point of basically indexing and linking.

Linked open data is a return to publishing, indexing, and linking. This is largely what you want to do—I want to find a connection between two datasets, I want to do some aggregation for multiple datasets around a particular protein or assay or scientist. For this, you have to focus on identity and adoption of common vocabularies. The adoption is more important than the complexity of the vocabularies, because you’re relying on the ubiquity of the terms. It’s just enough, just in time, not just in case.

When we return to that, you see that makes some sense, a return to ideas of the 1990’s: how do we do data integration using some descriptions? So now we’re on track, but there are a few fundamental issues we have to sort out:

1) We have to sort out the infrastructure. We need scalable tools to handle large amounts of data. We’ve built trillions of facts that are readily integratable. I don’t ever want to see an RDF triple—I want to see a genome browser that happens to be powered by RDF triples.

2) We need ubiquity in publishing using some identifiers and some common concepts so we can do some linking. The emphasis is on adoption.

3) We need a primary mechanism of dealing with provenance—where did this thing come from? Amazingly, that wasn’t considered a prime piece of the Semantic Web originally. There is no infrastructure for provenance and versioning. That’s pretty damn important. We absolutely have to sort that out.

Building bridges between datasets …

{ Comments on this entry are closed }

The Osborne Portable PC strained my eyesight but opened up a whole new world for me. When I figured out how to make it do something useful (with insurance data), I figured – no, KNEW – the world was going to change. After all, if I could do it, anyone could.

Years later, I gave my Osborne to the New York State Museum. They were happy to have it.

{ Comments on this entry are closed }

Use ALL the Damn Data

by John W Rodat on March 30, 2011

Follow the trail here. It’s only barely polite, if that. But the issues are important. The issues are what should our economic policy be and the honesty in data analysis.

John Taylor is a professor of Economics at Stanford. In mid-January, he posted an analysis, Higher Investment Best Way to Reduce Unemployment, Recent Experience Shows that depicted a very strong relationship between unemployment and investment. The higher the investment as a percentage of GDP, the lower the unemployment. Or something like that.

Some economists argue that the efforts now underway to reduce government spending as a share of GDP will have adverse effects on unemployment. This is not what the data show. Consider this chart which shows the pattern of government purchases as a share of GDP and the unemployment rate over the past two decades. (The data are quarterly seasonally adjusted from 1990Q1 to 2010Q3.) There is no indication that lower government purchases increase unemployment; in fact we see the opposite, and a time-series regression analysis to detect timing shows that the correlation is not due to any reverse causation from high unemployment to more government purchases.In sharp contrast, the data on spending shares show that the most effective way to reduce unemployment is to raise investment as a share of GDP. The second chart shows the relation between unemployment and fixed investment over the past two decades. Higher shares of investment are associated with lower unemployment.

A couple days ago, Greg Mankiw wrote A Striking Scatterplot and noted that causality could be reversed, but was nevertheless impressed.

Of course, causality goes in both directions: Strong investment demand leads to lower unemployment, and a stronger economy, reflected in lower unemployment, encourages investment spending. As a result, the interpretation of this scatterplot can be debated. But there is no doubt that the strength of the correlation is impressive.

But earlier today, Krugman pulled the analysis apart, saying,

It’s mostly the housing bust! Yes, business investment is low — but no lower than you might expect given the depressed state of the economy. In fact, business investment is roughly the same percentage of GDP now that it was at the same stage of the much milder 2001 recession.

What the data actually say is that we had a catastrophic housing bust and consumer pullback, and that businesses have, predictably, cut back on investment in the face of excess capacity. The rest is just politically motivated mythology.

And then Justin Wolfers eviscerated Taylor by pointing out that he didn’t use all the data and that earlier data showed a very different pattern.

Sometimes you see the perfect piece of evidence. The scatter plot that is just so. The data line up perfectly. And then you realize, perhaps they’re just too perfect. What you are seeing is advocacy, dressed up as science. Here’s an example, provided by John Taylor.

And then Brad DeLong summed it all up Krugman here and Wolfers here, complete with the key graphics.

Lessons?

  • Now I don’t have either the credentials or the “cred” of any of these guys. But what seemed obvious to me to be missing was some notion of the time it takes for a cause to have an effect. There’s just no way that the effect could be effectively instantaneous, occurring in the same year. So one factor or the other should have been lagged.
  • If you’re not going to use all the data, at least say so and attempt some sort of explanation. (Hmmm. I may be guilty of that failure.)
  • When you post to the web on a controversial and important matter and you have an audience, somebody will attempt to challenge your analysis. Sometimes, it’s just a legitimately different interpretation of the data. Heaven help you though if you’ve distorted or omitted key data.

Don’t just use the damn data. Use all the damn data.

{ Comments on this entry are closed }

Quote of the Day

by John W Rodat on March 24, 2011

Sarah Hartley of the Guardian posted a couple days ago in Data expert moves on from ‘telephone journalism’ and quoted Francis Irving saying:

In the end we’ll no more talk about data journalism than we talk now
about telephone journalism.

Irving has a successful track record of using the web to provide the public with information on public issues and decisions, such as Parliamentary voting records.

Hartley also posts information on a Data Journalism Camp sponsored by the Digital Editors Network to be held in Manchester in May, 2011

{ Comments on this entry are closed }

Even a cursory view suggests to me that the Guardian has been making extra efforts at using data and visualized data at that in their reporting. You’ll find them particularly at their Datablog.

Here’s their story on five key data sets in the latest budget. One graphic in particular stood out for me and that was on public sector debt. I’m going to look at that more closely.

They use a couple different tools, including Timetric and IBM’s Manyeyes.

They also regularly use some good practices. The graphics are interactive and the data are downloadable.

There are stories in data. Just like reading, it takes literacy to understand the stories in data, and visualized data. Literacy takes learning and learning takes time.

There are stories in data. Just like writing, it takes literacy to understand how to tell stories that are embedded in data. That, as well, takes learning and time.

{ Comments on this entry are closed }

County Spending Animated

by John W Rodat on March 22, 2011

Per capita expenditures animated by county (excluding the five counties of New York City). (Flash required.)NYS County Expend per Capita 1998-2009.swf. Experiment with the scales, speed and labeling.

{ Comments on this entry are closed }