The 5 Star of Structured Data

Sir Tim Berners-Lee, the creator of the Internet, defined a 5-star deployment scheme for open data. In recent customer discussions, we’ve talked about a similar scheme to describe the status of data across their organisation and how text analytics can help contextualise unstructured data.

Structured and unstructured data

Sir Tim Berners-Lee, the creator of the Internet, defined a 5-star deployment scheme for open data. In recent customer discussions, we’ve talked about a similar scheme to describe the status of data across their organisation and how text analytics can help contextualise unstructured data.

Using the table below, how much of the information in your company sits in each bucket?

Unstructured content locked in proprietary formats or systems
As above but in open, accessible formats (PDF, Office)
★★ Semi-structured data with basic definition (XML, Excel)
★★★ Semantic, structured data where elements are described to some formal specification (RDF, Ontologies)
★★★★ Linked, interoperable semantic data

 

If you were to plot out how the data in your organization falls into the 5 categories above,

  • How much would be 3* or below?
  • Can you afford to ignore data that is locked up in those lower stars?
  • How much of your 3 structured data is mapped to the appropriate ontologies needed to truly leverage that content?

A eutopic vision would be to have all available data indexed, structured and linked together. Clearly we are not there yet, but it’s not as much of a pipe dream as you might think. Follow me on a short journey from 1-5★ of data structure and learn how SciBite text analytics and semantic technologies can help transform your data.

and ★★ Silos

Let’s face it; large swathes of scientific content, in their very nature, are unstructured. Publications, conference records, news articles , and even internal presentations – this valuable scientific data is often spread across multiple locations and formats. There simply isn’t the time and money available to manually process ever-increasing data generation volume and velocity. Technology needs to lend a hand.

★★★ Moving to contextualised data

Understanding the content and focus of each document speeds up the process of filtering through large volumes of information for the right data. Ambiguity and synonymy are commonplace in scientific text and complicate simple keyword extraction techniques. Controlled vocabularies and extensive ontologies help to group related terms, but who manages and updates those you need? Designed to understand the complexity of the scientific text, SciBite’s Named Entity Recognition engine, TERMite calls on a reference library of millions of scientific synonyms stored in multiple ontologies to transform documents of any type into semantically enriched machine-readable data.

★★★★ Subject-predicate-object

Once the entities have been identified, disambiguated, and multiple related synonyms normalized, TERMite can output results in multiple structured formats, including RDF, NoSQL, Graph (Neo4j), and many more. Building graphed data from once plain text really starts to open up the exploratory potential of this data through text analytics and forms the foundation of many current data integration strategies.

★★★★★ A single view of many slices

The final stage in our 5★ of data structure is linking together multiple sources of information. Here, we have linked data from Pubmed, Clintrials.gov, OrphanNet all in the same database ready for analysis.

SciBite’s technologies are designed to put into many of the current major systems for 4 and 5 data, including Neo4JOpen Link VirtuosoCambridge Semantics Anzo platform, SpotfireLinkurious, and many more.

Regardless of the source data, or end applications used, the results should be linked in a manner that lets the science speak for itself.

Related articles

  1. The Relationship Game – Knowledge Graphs

    Scientific knowledge can be represented as relationships between things. Thousands or millions of such relationships make a knowledge graph or network analysis. SciBite technology enables extraction of these relationships, and in doing so, can uncover knowledge that might otherwise have remained hidden

    Read
  2. The 5 Star of Structured Data

    Sir Tim Berners-Lee, the creator of the Internet, defined a 5-star deployment scheme for open data. In recent customer discussions, we’ve talked about a similar scheme to describe the status of data across their organisation and how text analytics can help contextualise unstructured data.

    Read

How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us