In Mindbreeze, the Semantic Pipeline refers to the process after a document is picked up by a crawler and then ultimately indexed. During this process, content and metadata can be extracted and manipulated to drive more intelligence in the Enterprise Search Solution. The following are the steps that occur in the Semantic Pipeline.
- The Filter Service is used to extract all the content from the document to define what should be indexed, as well as what content should be used for the content preview. Different filters can be selected based on content type. A couple examples of this would be creating PDF previews for Microsoft Office Documents or extracting and indexing content inside zip files.
- After the content has been extracted by the Filter Service, the Post-Filter Transformation step can be used to manipulate the content with custom code. Mindbreeze offers a Java SDK to build these plugins.
Precomputed Synthesized Metadata
- Precomputed Synthesized Metadata allows us to create or edit existing fields with functions known as Property Expression Language. The time at which this metadata is created or edited can be configured to run at different points using the “Transformation Pipeline Slot”.
- To see an example of Precomputed Synthesized Metadata, please see our previous blog post regarding the topic.
- The Mindbreeze property expression language is helpful when defining synthesized metadata. Please see the Mindbreeze Property Expression Documentation for the full details.
- Entity recognition is used to derive metadata from content or other metadata based on rules and patterns. It is applied at the index level.
- CSV Transformation is another way to manipulate metadata, but instead by mapping values based on a CSV file.
- The final way to create metadata is with Item Transformation Plugins. Item Transformations are similar to Post-Filter Transformations, but can be applied per index and after all other manipulation has been performed. Mindbreeze offers a Java SDK to build these plugins.