Semantic Web: Tools you can use

23.03.2011

Vince Fioramonti had an epiphany back in 2001. He realized that valuable investment information was becoming increasingly available on the Web, and that a growing number of vendors were offering software to capture and interpret that information in terms of its importance and relevance.

"I already had a team of analysts reading and trying to digest financial news on companies," says Fioramonti, a partner and senior international portfolio analyst at Hartford, Conn.-based investment firm Alpha Equity Management. But the process was too slow and results tended to be subjective and inconsistent.

The following year, Fioramonti licensed , Intelligent Data Operating Layer (IDOL), to process various forms of digital information automatically. Deployment ran into a snag, however: IDOL provided only general semantic algorithms. Alpha Equity would have had to assign a team of programmers and financial analysts to develop finance-specific algorithms and metadata, Fioramonti says. Management scrapped the project because it was too expensive.

(For more information about semantic technologies, including search, see Part 1 of this story, ".")

The breakthrough for Alpha Equity came in 2008, when the firm signed up for . The service collects and analyzes online news from 3,000 Reuters reporters, and from third-party sources such as online newspapers and blogs. It then analyzes and scores the material for (how the public feels about a company or product), relevance and novelty.

The results are streamed to customers, who include public relations and marketing professionals, stock traders performing automated black box trading and portfolio managers who aggregate and incorporate such data into longer-term investment decisions.

A monthly subscription to the service isn't cheap, Fioramonti says. According to one estimate -- which Thomson Reuters would not comment on -- the cost of real-time data updates is between $15,000 and $50,000 per month. But Fioramonti says the service's value more than justifies the price Alpha Equity pays for it. He says the information has helped boost the performance of the firm's portfolio and it has enabled Alpha Equity to get a jump on competitors. "Thomson Reuters gives us the news and the analysis, so we can continue to grow as a quantitative practitioner," he says.

Alpha Equity's experience is hardly unique. Whether a business decides to build in-house or hire a service provider, it often pays a hefty price to fully exploit technology. This is particularly true if the information being searched and analyzed contains jargon, concepts and acronyms that are specific to a particular business domain.

Here's an overview of what's available to help businesses deploy and exploit semantic Web infrastructures, along with a look at what's still needed for the technology to achieve its potential.

At the core of of a semantic Web is federated search. This would enable a search engine, automated agent or application to query hundreds or thousands of information sources on the Web, discover and semantically analyze relevant content, and retrieve exactly the product, answer or information the user was seeking.

Although federated search is catching on -- most notably , which supports it as a feature -- it's a long way from a Webwide phenomenon.

To help federated search gain traction, the World Wide Web Consortium (W3C) has developed that define a basic semantic infrastructure. They include the following:

• , which defines a standard language for querying and accessing data.

• and RDF Schema (RDFS), which describe how information is represented and structured in a semantic ontology (also called a vocabulary).

• , which provides a richer description of the ontology and also includes some RDFS elements.

The final versions of these standards are supported by leading semantic Web platform vendors such as Cambridge Semantics, Expert System, Revelytix, Endeca, Lexalytics, Autonomy and Topquadrant.

Major Web search engines, including , and Bing, to prioritize searches and to support W3C standards like RDF.

And enterprise software vendors like Oracle, SAS Institute and IBM are jumping on board, too. Their offerings include , and .

Semantic software uses a variety of techniques to analyze and describe the meaning of data objects and their inter-relationships. These include a dictionary of generic and, often, industry-specific definitions of terms, as well as analysis of grammar and context to resolve language ambiguities such as words with multiple meanings.

The purpose of resolving language ambiguities is to help ensure, for example, that a shopper who does a search using a phrase like "used red cars" will also get results from Web sites that use slightly different terms with similar meanings, such as "pre-owned" instead of "used" and "automobile" instead of "car."

For more information about semantic technologies, including search, see Part 1 of this story, "." It explores the technology's potential uses and paybacks, illustrated with real business cases, including ones involving . It also provides some best practices and tips from the trenches for anyone planning, or at least considering, a deployment.

W3C standards are designed to resolve inconsistencies in the way various organizations organize, describe, present and structure information, and thereby pave the way for cross-domain semantic querying and federated search.

To illustrate the advantage of using such standards, Michael Lang, CEO of Revelytix, a Sparks, Md.-based maker of ontology-management tools, offers the following scenario: If 200 online consumer electronics retailers used semantic Web standards such as RDF to develop ontologies that describe their product catalogs, Revelytix's software could make that information accessible via a SPARQL query point. Then, says Lang, online shoppers could use W3C-compliant tools to search for products across those sites, using queries such as: "Show all flat-screen TVs that are 42-52 inches, and rank the results by price."

Search engines and some third-party Web shopping sites offer product comparisons, but those comparisons tend to be limited in terms of the range of attributes covered by a given search. Moreover, shoppers will often find that the data provided by third-party shopping sources is out of date or otherwise incorrect or misleading -- it may not, for example, have accurate information about the availability of a particular size or color. Standards-based querying across the merchants' own Web sites would enable shoppers to compare richer, more up-to-date information provided by the merchants themselves.

The W3C SPARQL Working Group is currently developing a SPARQL Service Description designed to standardize how SPARQL "endpoints," or information sources, present their data, with specific standards for how they describe the types and amount of data they have, says Lee Feigenbaum, vice president of technology at Cambridge Semantics and co-chair of the W3C SPARQL Working Group.

Tools, platforms, prewritten components and services are available to help make semantic deployments less time-consuming, less technically complex and (somewhat) less costly. Here's a brief look at some options.

is an open-source Java framework for building semantic Web applications. It includes APIs for RDF, RDFS and OWL, a SPARQL query engine and a rule-based inference engine. Another platform, , is an open-source framework for storing, inferencing and querying RDF data.

Most leading semantic Web platforms come with knowledge repositories that describe general terms, concepts and acronyms, giving users a running start in creating ontologies. "Customers have conflicting demands: to have the platform be able to come back with accurate answers out of the box, and to have it tailored to their business area," says Seth Redmore, vice president of product management at Lexalytics.

To address that quandary, Lexalytics sells its semantic platform primarily to service provider partners, who then fine-tune it for specific business domains and applications. Thomson Reuters' Machine Readable News service is one example.

Other platform vendors have been rolling out business-specific solutions. Endeca, for example, provides for e-business and enterprise semantic applications, including specific offerings for e-commerce and e-publishing.

There are also tools to automatically incorporate semantic metadata, and W3C standards, into existing bodies of information. For example, automatically transforms both structured and unstructured data to RDF, according to Lang. It then presents, or "advertises," the information on the Web as a SPARQL endpoint that can be accessed by SPARQL-compliant browsers, he adds.

An open-source tool called can map selected database content to RDF and OWL ontologies, making the data accessible to SPARQL-compliant applications.

Revelytix sells a W3C-compliant knowledge-modeling tool called , a wiki-based framework designed to help everyone from technical specialists and subject matter experts to business users collaboratively develop a semantic vocabulary that describes and maps domain-specific information residing on multiple Web sites. Communities of interest can then use Knoodl.com to access, share and refine that knowledge, according to Lang.

For example, consultancy Dachis Group has developed what it calls a Social Business Design architecture whose purpose is to help users collaborate, share ideas and then narrow down and "expose and make sense of" data within a business organization or other community of relevant individuals, such as customers or partners, says Lee Bryant, managing director of the firm's European operations.

Such offerings can significantly ease the task of developing a semantic infrastructure. For instance, Bouygues Construction used Sinequa's semantic platform, , and needed only about six months to do an initial implementation of a semantic system for locating in-house expertise, according to Eric Juin, director of e-services and knowledge management at Bouygues.

Bouygues has since developed a semantic search application that helps knowledge workers quickly find information that resides either on internal systems or on the Web, Juin says.

Context Engine indexed and calculated the relevance of people and concepts in a half-million documents, including meeting minutes, product fact sheets, training materials and project documentation, he says. The platform includes a "generic semantic dictionary" of common words and terms, which it can translate between various languages, according to Juin. For example, a French employee could semantically search a document written in German.

Certain business-specific acronyms and terms have to be added manually -- that's an ongoing process that requires semantic experts to collaborate with business users, Juin says. Over time, however, his group has been adding fewer keyword definitions, because the semantic engine can use other, related words to determine a term's relevance to a specific subject, he says.

Companies that lack the internal resources to build their own semantic Web infrastructure can follow Alpha Equity's lead and go with a semantic service provided by a third party.

One such provider is Thomson Reuters, which, in addition to its Machine Readable News service, offers a service called through which it creates semantic metadata for customers' submitted content. Customers can deploy that tagged content for search, news aggregation, blogs, catalogs and other applications, according to Thomas Tague, a vice president at Thomson Reuters.

OpenCalais also includes a free toolkit that customers can use to create their own semantic infrastructures and metadata, and to set up links to other Web providers. The service now processes more than 5 million documents per day, according to Tague.

DNA13 (now part of the ), (now the owner of Scout Labs) and Cymfony are among the semantic service providers that query, collect and analyze Web-based news and social media, with an eye toward helping customers in areas such as brand and reputation management, customer relationship management and marketing.

In a of about 895 semantic technology experts and stakeholders, 47% of the respondents agreed that Berners-Lee's vision of a semantic Web won't be realized or make a significant difference to end users by the year 2020. On the other hand, 41% of those polled predicted that it would. The remainder did not answer that query.

The basic W3C standards are finalized and gaining support, and there's an increasing number of platforms and software tools. Still, semantic Web technology -- and standards -- are far from achieving that critical mass of support needed to fully exploit their benefits, experts agree.

It's important at this point to make a clear distinction between semantic technologies in general and semantic Web technologies that make use of W3C standards and that specifically apply to Web information sources.

Semantic technologies are definitely catching on, particularly for enterprise knowledge management and business intelligence, experts agree. The market for semantic-based text analysis tools that help users "find what they want in unstructured information" is growing at about 20% per year, says Susan Feldman, an analyst at research firm IDC. Moreover, most enterprise search platforms now include semantic technology, she says.

Compared with more traditional BI tools, one of semantic technology's main benefits is that it gives subject matter experts the ability to build their own query structures without IT needing to go through the rigorous and time-consuming tasks of building and then rebuilding data warehouses and data marts. For example, "an expert in, say, compliance and regulations can build a semantic structure in two weeks, not nine months," and then change it quickly and easily, says Mills Davis, managing director at semantic research firm Project10X.

Other benefits of semantic technology -- again compared to traditional BI tools -- include the ability to perform more complex and broader queries and analysis of unstructured data, and the ability to start small with targeted queries and then grow and evolve in small increments.

On the Web, semantic technology has established a firm foothold in a growing number of niche business markets. One is e-publishing, where online news services DBpedia, Geonames, RealTravel and MetaWeb (Freebase) were early adopters. Another is the online financial information services business, where companies such as Thomson Reuters and Dow Jones have jumped on board. Some of the prominent users of Thomson Reuters' OpenCalais offering include news media organizations like CBS Interactive and its CNET unit, Slate, the Huffington Post, and e-news aggregator Moreover Technologies. Furthermore, over 9,000 online publishing sites now use OpenPublish, a package that integrates OpenCalais with Drupal, an open-source content management system.

More recently, online retailers have started deploying semantic Web platforms to help optimize product and brand placements in search engine results, and to provide consumers with richer and more efficient shopping experiences.

Still missing, however, is widespread support of W3C standards and common vocabularies that will facilitate semantic queries across different Web and business domains. Right now, the majority of semantic Web schemas are developed by, and proprietary to, individual companies both on and off the Web, and different groups within a business enterprise. Such frameworks often contain business- and function-specific terms, jargon and acronyms that don't translate well to other knowledge domains. As a result, to do cross-domain querying, semantic applications and services must interface with each information source's ontology individually, according to industry sources.

Take the case of Eni. The global energy company's technical and subject matter experts have spent 12 years developing and fine-tuning a semantic-based BI platform based on Expert System's Cogito, according to Daniele Montanari, Eni's practice leader for semantic technologies. The platform supports oil-, gas- and power-related trading, production and logistics processes, Montanari adds.

Cogito allows Eni's end users to go to a preselected and often presubscribed information source on the Web, locate key information on a particular topic and generate a "corpus" that can then be downloaded, automatically updated and semantically queried, Montanari says.

Semantic schemas tend to be specific to a particular business area, Montanari says. For example, the company's refining division has developed semantic frameworks and classifications to quickly locate information within a vast corpus of articles. Many of those articles were written by Eni's R&D group, while others come from Web sources to which the group subscribes, he notes.

However, generalized Web searches -- say, for the latest technical developments in the oil industry -- are problematic because each site has its own largely proprietary ontology, Montanari says. "To cover multiple sources within an information domain, you have to define a common semantic model," he notes.

The same issues apply to internal semantic queries, Montanari says. His group had once hoped to create an enterprisewide semantic schema that would "model and map correspondences for everything in our databases and data sets, with no ambiguity anymore," but the company was unable to resolve differences among business domains including oil, gas, R&D, marketing and others.

"Even at the linguistics level, there are issues," he notes. As a result, internal queries tend to remain within a particular business group or specialty.

Standardized ontologies are starting to crop up in industries feeling regulatory and/or customer pressure, such as healthcare and pharmaceuticals. Whether e-commerce companies will rally around a common schema remains to be seen.

One such effort is the e-commerce semantic vocabulary. As of now, only a handful of companies, including BestBuy.com and Overstock.com, have signed up for it. Google recently announced that it also supports the vocabulary, according to Hepp Research, which markets and publishes GoodRelations.

"Like telephones and the Internet... the technology becomes more valuable as more people use it," says Phil Simon, a consultant and the author of . What's still missing, for many businesses, is a clear payback that will justify the often major cost of deployment, he adds. A company that wants to make a large body of unstructured information accessible, either internally or on the Web, "can spend years and years setting up a semantic Web infrastructure ... before it sees a payoff," says Simon, noting that such efforts can involve huge investments in cleaning up and tagging of data, plus investments in new technologies.

Indeed, the semantic Web, like many other groundbreaking information technologies before it, may be stuck in a classic Catch-22: A critical mass of users is needed before the benefits kick in, but businesses, particularly e-commerce companies, won't jump in until that magic number is reached.

In his blog, , BestBuy.com's lead Web development engineer, Jay Myers, says: "Product categories can be unique to a retailer/manufacturer, and with billions of consumer products and endless numbers of product categories, universal product categorization seems to be an unreachable goal. I have seen a few attempts at mass product categorization, but I haven't seen a ton of progress (who would want to manage a massive global product taxonomy?!). Furthermore, getting consensus on category definitions seems like a futile effort that should really be avoided."

More optimistically, he goes on, "just because there aren't any universal standards out there doesn't mean we can't start giving machines a shot at some semblance of product categorization" using available W3C standards and ontologies like GoodRelations. "That's a win-win," he adds, "because the business gets satisfied customers, and the customer makes optimal buying decisions based on relevant product data."

Indeed, many other members of the semantic Web community remain hopeful that semantic technology will revolutionize the Web -- eventually.

"Increasing user and data mobility, and the expansion of Internet services and digital data information into everyday life, are pushing us in the semantic direction," says Project10X's Davis. With the rapid proliferation of Web information sources whose provenance is questionable, he continues, "you're not just looking for a needle in a haystack -- you're looking for the right haystack. Semantics provides a critical means of separating the wheat from the chaff."

"When I mention semantic Web in tech circles, nine out of 10 don't know what the hell I'm talking about," Next Wave author Simon notes. "But do I believe in its power, and that it will be a game changer down the road? Absolutely."

Horwitt, a freelance reporter and former Computerworld senior editor, is based in Waban, Mass. Contact her at .