OpenPipeline – an open-source document processing pipeline

Most commercial search engines include a more or less advanced document processing pipeline for transforming raw input into something that can be indexed. The process involves normalization, entity extraction, linguistic processing, annotation, data cleansing etc. When it comes to Open Source search engines, they start getting pretty good at the core of indexing and search,

The state of open source search

Gnu logoOpen Source Software (OSS) and free software has been an alternative to commercial, licensed software for decades. Most known and successful are perhaps projects like GNU/Linux (licensed under the GNU General Public License, GPL), OpenOffice.org, Apache web server and MySQL. They have all managed to produce excellent, high-quality, stable software with an impressive wide-spread use. Other well known projects that are also Open Source are Java programming language, Norwegian TrollTech’s (now Nokia) Qt, Mozilla Firefox, Thunderbird, eZ Publish, and the list goes on.

For Search, there are a few players picking up speed that you should be aware of: