The YaCy Search Server Is Sort-Of Being Actively Developed Again After Half A Decade Of Inactivity

From LinuxReviews
Jump to navigationJump to search
Yacy-logo.png

YaCy is a peer to peer search engine you can install on your own computer. It has been in development since 2005. It's horrible. It's slow and the search results are bad. Yet it is the best piece of free software there with the potential to truly take on the Google/Bing search duopoly. This may be a good time to get involved if you have the Java knowledge and the time.

written by 윤채경 (Yoon Chae-kyung)  2021-04-25 - last edited 2021-04-25. © CC BY

YaCy git 2021-04-25 search results for linux.jpg
YaCy git from April 25th, 2021, showing the "top" search results for "Linux". You have to add /language/en to get search results in English, there is no easy way to set your preferred language. The top search result will, of course, never be in English even if you bother adding /language/en to your search every single time you search for something.

The YaCy peer to peer search engine has been around to 2005. It works, sort-of. You can install it on your own desktop computer, or a home server or a NAS, and use it to search the web. It will produce search results, but they won't be very good when they eventually show up after what seems like forever if you are used to the very impressive search speed commercial search engines like Bing and Google, and Bing front-ends like Duck Duck Go and Ecosia, provide.

YaCy is written in Java, and the code-base is mostly ancient. What's worse is that it comes with, and relies on, a bundle of ancient and wildly outdated Java libraries. YaCy was pretty actively developed the first few years after it was released in 2005. Development kind of died out around 2010, with only sporadic minor changes now and then being added to the repositories. Development essentially died around 2016. That changed around March this year.

YaCy had a whole 2 commits in February 2021. That increased to 35 in March. Most of those were done by Michael Christen, YaCy's original author. There's only been four commits so far this month, and the month is almost over, so YaCy development hasn't exactly exploded. It's just.. not entirely dead anymore.

The biggest and most important change made in March was an upgrade from Solr 6.6 to 8.8.1. Solr is a search platform developed by the Apache foundation with the Apache Lucene search engine library as a base. Solr 6.0 was released in April 2016. The 6.x-incompatible and far more modern Solr 8.0 branch was released in March 2019. The two are incompatible, so anyone using YaCy who upgrades will loose all their old search index data. That's probably fine, if a website can't be re-indexed then it probably gone and not not something you would want in your search results anyway.

The current YaCy git version is still sorting search results from other peers by first received, first shown. How relevant those search results are doesn't matter, because if some results are available then they must be important. One alternative would be to wait for search results to come in, sort them and then present them according to their importance. We've had a patch for this on our YaCy under the headline "The technical details showing exactly why YaCy is a useless piece of horrid software" since September 2019.

YaCy git 2021-04-25 crawling a website.jpg
The YaCy web crawl monitoring page showing a website crawl in progress.

The current YaCy git version is as impolite as it was a decade ago. Nobody who runs a website wants some stupid crawler trying to fetch pages every ten milliseconds. We strongly recommend that anyone who considers trying it change the values in source/net/yacy/cora/protocol/ClientIdentification.java to something that doesn't make anyone who sees yacybot in their logs ban it, and perhaps the subnet is coming from, for it's grossly unacceptable out-of-the-box behavior towards the websites it crawls. Those values are, of course, not configurable from the web interface.

YaCy can slightly annoy people who have websites crawled by it in other ways. Point YaCy as a website sitemap and it will, of course, start with the pages given lowest <priority></priority> first and work it's way up to those with highest priority.

There is, as you may have gathered by now, a lot of room for improvement. YaCy really is a horrible piece of garbage as it stands, even with the unusually many code commits made during March 2021. So why do we even mention it?

Well, there aren't any alternatives. YaCy is the only somewhat-working peer to peer search engine. A distributed censorship-resistant search engine with tens of thousands of nodes across the world without bias, censorship or advertisements would be a really good thing. The idea is sound, and it is a shame that the only implementation is far from it.

YaCy network map 2021-04-25.jpg
YaCy network map as of April 25th, 2021.

To put our criticism of YaCy in some context: YaCy has, after 16 years, managed to archive a peer to peer network consisting of a whopping 280 nodes. That's not exactly what you would call popular. There have likely been a lot of people who have tried it, deleted it and written it off as the trash it is over the years.

This really is a free software (GNU GPLv2) licensed software project that could go far and wide if it was given some desperately needed attention from a few hobbyists with basic Java programming skills and the time it takes to get involved with a project like this.

How To Try It[edit]

First, make sure you have a Java development kit package installed. Those would typically be called something like java-11-openjdk. Search your distributions repositories for java and grab whatever package has open and jdk in the name. That, and git is all you need to checkout the code and compile it with ant clean all.

https://github.com/yacy/yacy_search_server.git
cd yacy_search_server
ant clean all

The ant clean all command will compile it even though it doesn't sound like it would if you are unfamiliar with old-school Java applications. You will see a metric ton of messages about deprecated functions being used when you compile it. That's one of the many areas YaCy can be improved. Starting YaCy is done by running a script file named startYACY.sh. It will show a message informing you what port it is listening to localhost at so you can go there and configure it.

You can show up at https://github.com/yacy/yacy_search_server with patches if you have the time and Java skills required to take on a peer to peer search engine project that could potentially topple the Google-Bing search engine duopoly. And yes, there's just the two big ones (Not trying to belittle www.mojeek.com and yandex.com, they exist but their worldwide market share is barely measurable). An index makes you a search engine, Bing skins pulling search results from Bing & showing them in a custom "privacy-focused" skin (=DuckDuckGo) don't count. You may get a bit more privacy, but you won't get to see anything big tech don't want you to see. You would need something like a large peer to peer network with a completely different and independent index to get additional diversity. YaCy could fill that void.

0.00
(0 votes)


avatar

Anonymous (bdfa672767)

9 months ago
Score 0
I agree. Smart review. I have the java chops to make some contributions, but honestly I'd want to rewrite the whole thing from scratch. I don't have the time for that.
Add your comment
LinuxReviews welcomes all comments. If you do not want to be anonymous, register or log in. It is free.