In Mindbreeze, the Semantic Pipeline refers to the process after a document is picked up by a crawler and then ultimately indexed. During this process, content and metadata can be extracted and manipulated to drive more intelligence in the Enterprise Search Solution. The following are the steps that occur in the Semantic Pipeline.

The Semantic Pipeline starts at the crawler, then the filter service, then post-filter transformation. After that it moves to index where entity recognition, CSV transformation, and item transformation occur when applicable.

 

Filter Service

  • The Filter Service is used to extract all the content from the document to define what should be indexed, as well as what content should be used for the content preview. Different filters can be selected based on content type. A couple examples of this would be creating PDF previews for Microsoft Office Documents or extracting and indexing content inside zip files.

Post-Filter Transformation

  • After the content has been extracted by the Filter Service, the Post-Filter Transformation step can be used to manipulate the content with custom code. Mindbreeze offers a Java SDK to build these plugins.

Precomputed Synthesized Metadata

Entity Recognition

  • Entity recognition is used to derive metadata from content or other metadata based on rules and patterns. It is applied at the index level.

CSV Transformation

  • CSV Transformation is another way to manipulate metadata, but instead by mapping values based on a CSV file.

Item Transformation

  • The final way to create metadata is with Item Transformation Plugins. Item Transformations are similar to Post-Filter Transformations, but can be applied per index and after all other manipulation has been performed. Mindbreeze offers a Java SDK to build these plugins.

0 Comments