1 Dougis

Faceted Search Research Papers

Can you recall the last time you touched a physical phone book? We once relied on such tools as the only means of finding certain kinds of information. They were clunky and cumbersome, but when looking for a needle in a haystack—one phone number among thousands—they were essentially our only hope. In physical media, a single piece of information could be placed in only one location at a time, so providing access to many such objects required a static organization system describing the exact location of each item.

Direct access to digital information has completely changed the rules, so we can now be "better than reality" with search boxes, users can jump directly to whatever they are interested in, without consulting any complex systems. This instant gratification is a vast improvement over the previous methods of flipping through physical pages. But the spread of digital access to information has been accompanied by an explosion in the volume of information available about any given topic, to the extent that even instant access via search doesn’t necessarily make finding our needle in the haystack any easier.

Filters vs. Facets

Fortunately many websites now provide even more sophisticated tools to help users find information. Filters are one such tool — they analyze a given set of content to exclude items that don’t meet certain criteria. More recently, rich information systems have also begun to provide faceted navigation, which basically extends the idea of filters even further into a complex structure that attempts to describe all the different aspects of an object, for maximum flexibility in information retrieval.

These two terms — filters and faceted navigation — are sometimes used interchangeably. There is in fact quite a lot of overlap between these concepts: they share the same basic mechanism of analyzing a large set of content and excluding any objects that don’t meet certain criteria. The difference between the two is essentially one of degree, but it is an important difference. Ideally faceted navigation provides multiple filters, one for each different aspect of the content. Faceted navigation is thus more flexible and more useful than systems which provide only one or two different types of filters, especially for extremely large content sets. Because faceted navigation describes many different dimensions of the content, it also provides a structure to help users understand the content space, and give them ideas about what is available and how to search for it.

For example, imagine searching for a healthy recipe for green enchiladas. Cooks.com has hundreds of green enchilada recipes. But in the absence of any filters, it’s difficult to find a healthy recipe, unless there happens to be a recipe with the word "healthy" in the recipe title. (In this case, no such luck.)

Basic search filters like the tabs on Food.com can help users narrow down large sets of search results, but only if the filters actually match the dimensions that are most important to users. On Food.com, the tabs filter by the type of content — recipes, photos, cookbooks, etc. — so in this case they wouldn’t be helpful.

In contrast, the full-fledged faceted navigation on Epicurious.com allows users to narrow results by several different dimensions, including Cuisine, Main Ingredient, and Dietary Consideration. In this case it’s a simple matter to view only healthy recipes.

Although the faceted navigation system has obvious benefits for end users, this type of structure is significantly more expensive to create and maintain; more resources must be invested in designing the user interface, and both existing and future content must have metadata applied for each facet.

The extra power of faceted navigation also adds interaction cost by presenting users with more options to comprehend and manipulate. A simple filter can often be easier to understand and faster to use.

For these reasons it’s wise to make sure that users truly do need faceted navigation in order to use your content effectively, before investing in it.

Learn more about faceted navigation and filtering in our full-day course on navigation design.

The following facets are included as part of “core” search results.

PLOS Search Field DescriptionNote
Specialized indexes for faceting, see notes below
doc_partial_parent_idDOIthe ID value of the parent document
doc_typeDocument TypeTwo possible values: full or partial
doc_partial_typeThe type article sectionintroduction, abstract, etc
doc_partial_bodyThe text of the article section
Facet fields
affiliate_facetAffiliate (facet)Don’t search against
author_facetAuthor (facet)Don’t search against
subject_facetSubject (facet)Don’t search against
editor_facetEditor (facet)Don’t search against
article_type_facetArticle Type (facet)Don’t search against

However, two of these facets deserve special attention: cross_published_journal_key and doc_partial_type. To build the result sets for these two fields, two additional SOLR queries are required.

The “cross_published_journal_key” facet provides vision into how many documents (across all journals) match the terms you entered. It is queried separately because the “core” search is, by default, journal specific.

For details on building SOLR queries, look at Apache’s SOLR website. But here a few sample queries against our schema to get you started. Results are given in XML, Solr’s default format.

Simple search for the term “test”. An article is included in this result set if the word “test” appears anywhere in the article.

Search for the term “test” with facets. This query also queries for the term “test”, but the results also include facets for subjects, authors, editors, article types, and affiliates.

Get the journals facet (cross_published_journal_key). This shows all of the Journals which have been indexed in Solr.

Get the where my keywords appear facet (doc_partial_type). This is a list of all the sections of an article (e.g., Body, Materials and Methods, Introduction, etc) in which keywords can be specifically sought. For instance, if you want to know whether a name appears in only the References section.

When thinking of documents stored in SOLR, it’s important to think of each document as a collection of fields. Fields have different storage mechanisms optimized for searching and faceting. It’s important at this point to have a good understanding of what a facet is.

Faceted search is the dynamic clustering of items or search results into categories that let users drill into search results (or even skip searching entirely) by any value in any field. Each facet displayed also shows the number of hits within the search that match that category. Users can then “drill down” by applying specific constraints to the search results. Faceted search is also called faceted browsing, faceted navigation, guided navigation and sometimes parametric search.” — Lucid Imagination

Not all stored fields should be searched against. Fields ending with _facet are stored in a way to generate facets accurately and are not designed to be search against. Some fields are indexed, but not stored and therefore can not be part of the search results. We also store two types of documents as defined by the doc_type field: “full” and “partial”. The later being for computation of the “Where my keywords appear” facet.

For normal search queries, “doc_type:full” should always be used as a filter (the fq url query parameter).

Note, that we use two types of searches, dismax and standard. Dismax searches are used for simple searches where no fields are specified. Under this circumstance the title, author and everything fields are searched with the highest priority given to title and author. For more details on dismax look at the SOLR configuration file and SOLR documentation. For 90% of our searching this is what should be used. Standard searches can be used for more specific results against specific fields, for these searches, one or more fields must be specified to search against.

“Where my Keywords Appear” Facet

The logic gets a little tricky here. SOLR out of the box does not provide a way for us to tell our users what areas of the document the search terms appeared in. In fact this is kind of backwards to the way SOLR is designed. But we determined that this was a powerful bit of knowledge and worth the effort in putting together a system that allows this as a possibility. To do this when an article is ingested into the system a number of SOLR documents are created. First a document is created of doc_type “full” that contains the whole body of the research article. For most searches this is all you’ll want to search against by using the filter query: “fq=doc_type:full”.

http://api.plos.org/search?q=id:10.1371/journal.pcbi.1000048

In addition to this first document, a number of document parts are created:

http://api.plos.org/search?q=doc_partial_parent_id:10.1371/journal.pcbi.1000048&fq=doc_type%3Apartial&fl=id,doc_partial_parent_id

You’ll notice that each of these partial documents contain a number of fields duplicated in the original article’s document, we do this so most search terms applied to the search for the parent document, can be applied to the document parts.

http://api.plos.org/search?q=id:10.1371/journal.pcbi.1000048/title&fq=doc_type%3Apartial&fl=*

The difference between the full document and the partial, is that partial has no fields representative of “everything” and instead have a “doc_partial_body” field and “doc_partial_type” field. FYI, “doc_partial_type” is not stored and can only be retrieved as a facet.

http://api.plos.org/search?q=id:10.1371/journal.pcbi.1000048/title&fl=&fq=doc_type%3Apartial&facet=true&facet.field=doc_partial_type&rows=0

So if we want to find all partial documents with terms that match our search query:

http://api.plos.org/search?q=doc_partial_body:test&fl=&fq=doc_type%3Apartial

If we want to generate a facet telling us what document parts contain the terms entered:

http://api.plos.org/search?q=doc_partial_body:test&fq=doc_type%3Apartial&rows=0&facet=true&facet.field=doc_partial_type

Leave a Comment

(0 Comments)

Your email address will not be published. Required fields are marked *