During last two week I was among the other things investigating how to implement search through all openSUSE web pages. As part of our Umbrella project, we want to make all our webs look unified and search through all of them is a part of this goal. So what I tried and what are my conclusions? Let’s see…
Customized Google Search
First idea would be to use Google search. They are offering customized search for anyones web. They are good at searching and we wouldn’t need implement anything by ourself. Other upside of this approach is that we wouldn’t need any infrastructure for this. They will let us use their machines.
But it has some downsides as well. One minor thing is that it will index all our sites regardless of their content. So wiki pages may come up less relevant then a comment in someones blog. I don’t think we really want this.
The major downside as I see it is the agreement. I’m not a lawyer, but I didn’t like it. Lets say that request that we have to display some advertisements and Google preferred links is ok. But according to the agreement we can’t customize the results as much as we want, we can just provide some theme and they may use it somehow (no details in agreement). Other thing I found quite a disturbing is that they can use our logo and trademarks forever to promote their products. Well, I don’t think we mind right now, but forever is a long time. Maybe I just didn’t understood the agreement well, but I’m quite sure that we don’t want to use it without discussions with some skilled lawyers.
At last but not least, as an Open Source community we should try to go for some Open solution. So I decided to check some Open Source engines available on the internet…
Open Source Solutions
I took a brief look at Xyzse and Swish++. Disadvantages I found were that last versions seemed to be released somewhere in 2008. This doesn’t have to be bad, but I think that something more alive may be better. And Other thing I didn’t liked was that it seemed like I need to hardcode some search limitations during compilation of the packages (at least it looked like that from installation instruction that required to edit some headers manually).
YaCy is a really interesting search engine. It’s main innovative idea is that it is decentralized. You just run one peer, connect it to the network and then search across all peers in that network. You don’t need any big server, you can make it work with just everybody indexing his own web. Really interesting idea. One small thing I personally didn’t like was that it is in Java (I don’t really speak Java). It was quite easy to try it as it started it’s own web server, but it looked like it wouldn’t be easy to customize it. It would be great to use it for my own webpages, but I think we want something else for openSUSE.
Datapark Search Engine
Last search engine I want to speak about is Datapark Search Engine. It is Open Source engine and it is written in C. For storage of the data it can use MySQL, PostgreSQL or SQLite. It can be used as a cgi on web, as an apache module or through it’s php bindings. Results page is highly customizable. It’s just a HTML template that gets read and filled with results. So it wouldn’t be any problem to create Bento theme for it and make it integrated with the rest of our webs.
Other interesting feature is that it allows to tag all servers and create hierarchical category list to make searching on some part of our infrastructure easier. Didn’t tried this feature yet, but I think we can use it. We can also add some extra points to the most relevant webs (I think wiki deserves this).
Last very interesting feature is that it can index pretty much anything. It doesn’t have to be only web pages. Everybody can write its own plugin that knows how to handle some specialized format. If I want to be able to search among the rpms on the Build Service, I can write easy filter to make it possible. And then during the search for MySQL I wouldn’t see only Wiki pages dedicated to the MySQL and related blogposts but also rpms of MySQL itself. Pretty interesting, isn’t it? I’m not really sure whether we want this, but we can do it with this search engine 😉
I think we should use Datapark Search Engine. Because it’s Open Source, it has categories and tags, it can add extra points to sites we like and it’s highly customizable. If I missed something interesting we should evaluate, please let me know. There are many interesting projects out there and I tried only few of them. Although I think I found what I was looking for, any comments are welcome as well as any suggestions…