Isara Search
The Little Search Engine That Could-
25 Million Pages and counting
Well, we didn’t reach our target of 50 million by the new year. But we did manage to get 25 million. Which is five times more than we had a few months ago.
Even though we didn’t reach our goal we did learn how to merge large indexes so that we will be able to (one day) have a 50+ million page index.
Currently we have our crawler (Gizmick) fetching data from hundreds of forums, blogs, and news websites. It’s A LOT of data and it requires a lot of filters (to clear all the junk) but we should be able to add another 10 million pages to our index within another week or two.
We’ll keep crawling until we can walk.
-
Isara Web Crawler’s Name Announced
This weekend we came to realize that a new name is needed for our web crawler. It was previously just named “Isara” but, after doing a test search for “Isara”, the results showed a lot of sites had our crawlers name, description, and e-mail address within their sites content. That is because some sites post HTTP Header information on their pages (Ex: “You are using Firefox browser…”). So when our crawler fetches those pages it also fetches our own bot’s http header information being reflected back at us. Since we don’t want our database cluttered with A LOT of “Isara” keywords, for sites that have nothing to do with Isara, we must change the name of our crawlers agent.name so that it will not disrupt the results. Not that anyone will be searching for “Isara” within the Isara search page, but it’s still nice to know the results are as accurate as possible.
We have now officially changed the name of our webcrawler to GIZMICK!
Gizmick is the name of a card game that one of our volunteers (Lara) taught us. The name kind of stuck with us so we thought it would be a fun name to call our web bot.
Currently Gizmick is crawling one hundred popular forums and dozens of major news sites.
-
Crawling Wikipedia’s Index
After numerous attempts to crawl Wikipedia’s HUGE database of articles online, it was suggested to us (Thanks, Ken!) to download Wikipedia’s database dump and crawl their articles locally, on our own server.
To do that we had to install MediaWiki (Wikipedia’s Open Source software) on one of our Ubuntu boxes, download Wikipedia’s latest XML database file (5gb compressed, 24gb expanded), and then convert/import that file into our local MediaWiki database.
So far the process hasn’t been easy. After following the tutorials we were able to import the 24gb XML file into MediaWiki’s SQL database and could see some pages, but they all showed the wiki and html code too. It was formatted incorrectly. We tried a few other ways of importing the data but kept getting errors. After a fresh install of everything again we’re trying, one more time, to import the XML file into the SQL database.
Having Wikipedia’s index locally will make crawling a lot easier. But getting it to work right is not easy.
-
Added a Gigabit Switch
This weekend we added a D-Link Gigabit switch to our setup. Previously we were just using a 10/100 hub and we were only getting data transfers of 8-10mb/s between servers. With the new switch we’re getting an average of 30mb. During one transfer, between our two main servers, we were able to get 51mb/s sustained speeds. Looks like our network is no longer the bottleneck.
We’ve asked someone with Nutch clustering experience to help us setup our cluster. It’s possible we’ll have a small cluster setup by the end of the year. Once that happens we’ll be better able to handle the large amounts of data we’re processing.
Until then we’ve got two servers crawling forums, blogs, and news sites. We’re still on target to reach 50m pages by the end of the year.
-
OutOfMemory Error
The new server has been built and it’s running about 40% faster than our previous box. It also runs quieter and much cooler (we added a larger CPU heatsink and fan). Thank you to everyone who helped us with the new equipment.
Everything is running great except that we’re receiving an error when indexing are full database of 23m pages. The error is:
Exception in thread “main” java.lang.OutOfMemoryError: PermGen spaceWhile indexing we noticed that the memory usage kept increasing steadily over 4-6 hours, eventually leading to the OutOfMemory error. This is usually a sign that the java garbage collection isn’t working correctly and completed tasks are not being emptied. As a test, we’ve installed JrockIt as our JVM using a great tutorial found here. It’s been indexing for about one hour and the memory usage has barely increased at all. It also seems to be only using one core of our Quad Core, whereas the other Java 6 JVM was using all 4 cores. We’ll know in a few more hours if JRockIt fixes our error. We’ve been troubleshooting this problem for 3-4 days. It would be nice to finally have a minor success.
One great thing about having a new server is that we can have the old server crawling while we fix this other problem. So we’re not losing any crawl time.
UPDATE: It’s been 24 hours and it’s still indexing. Memory usage is at 16%, up from 9% at the start, and only one core is being used at a time (typically 100%).
UPDATE2: After two days of processing our electricity went out. So we were unable to complete the indexing. Actually, we’re not sure if it was ever going to complete. Very little had been written to the tmp directory so it probably got lost in a loop. Oh well. Back to the drawing board.
-
Building a New Server
For the past few weeks we’ve been researching hardware to get for a new server. Since our current server doesn’t have onboard RAID we made that a requirement for our new mainboard. We also wanted to get a Quad Core processor so that our data parsing will be faster. Our budget is VERY limited so we have to rely on computer shops to give our charity major discounts. Shops here in Nong Khai know us well and give us some great prices, but they don’t always have the hardware we require. Whereas Bangkok has almost everything we need but we can’t always get discounted prices. For our new server we had to go the Bangkok route because the local shops didn’t have any server hardware.
The case we decided to get is a Lancool PC-K62 and it’s made by Lian Li, who is known for their high quality cases with great ventilation. Price: $90 +$2.50 (overnight delivery)

The mainboard is an Asus P5Q Deluxe. It has onboard RAID 0, 1, 5 and 10. It also has Dual gigabit LAN ports and 8 SATA ports, so we can load it up with harddisks. Price: $175.

For the CPU we got a Quad Core Q9550, 2.83GHz, with a whopping 12mb cache. We’ll most likely overclock it (just a little) so it’s doing 3GHz. Price: $290
The RAM and RAID harddisks will come from our old server but we’ll also be adding a 1tb OS drive so out total storage will be 3tb. We’ll start putting it together tomorrow and probably have it crunching data by the weekend.
-
Go Fetch: Part II
Having successfully completed 10m URLs we thought we’d try for 15m. To speed things up we tried boosting the thread count but anything over 50 threads resulted in a lot of UnknownHostExceptions and socket timeout errors. So, for now, it seems that 50 threads is the fastest we’ll be able to fetch.
Our last fetch took 10 days for 10m pages. I guess that means 15m will take up to 15 days to complete. We’ll let you know.
UPDATE: We’re on Day 9 of the 15m page fetch and so far so good. Our logs are averaging almost 200mb per day. Which means nearly 1.4m pages per day. If that continues then we should reach 15 million in another 2-3 days.
Day 11 and Nutch is still fetching. After it’s completed we’ll tell Nutch to parse all the data collected (over 375gb worth) and then update the database. It took more than 24 hours to parse 10m pages so it’s probably going to take almost 36 hours to parse these 15m pages. After wards it might be time to update our server hardware.
UPDATE 2: Fetcher is done! All the data has been parsed and the database has been updated (bin/nutch updatedb) successfully.
status 1 (db_unfetched): 127167675
status 2 (db_fetched): 23615077So we now have 23m pages in our index and 127m links yet to be crawled. We have another 450,000 pages we can add but we’re going to wait until we have our new hardware. We let our server rest for the night. It’s been going 24/7 for 16 days so we thought it deserved a break.
-
Twice the Nutch. Twice the Fun!
Much like MS Word, Firefox, and even Solitaire, it is possible to open several instances of Nutch at the same time and have them running on one computer to save time and resources.
That’s what we decided to do for the IMDb and Wikipedia portion of our crawl. Instead of crawling these two massive sites separately, we simply copied the original Nutch folder and renamed it Nutch2 (we’re clever that way). After that we added a little piece of code at the end of nutch-default.xml file to tell Hadoop where the second temp directory should be so that the two crawls don’t use the same temp folder.
<property>
<name>hadoop.tmp.dir</name>
<value>c:/tmp2</value>
<description>Base for Nutch Temporary Directories</description>
</property>This might not be necessary but it can’t hurt.
Then we assigned one Nutch fetcher to crawl IMDb and the other to Wikipedia. Each was set to only have 4-5 fetchers per host. That way we were not hitting the two sites with millions of requests.
-
Go Fetch!
With the successful completion of a 2.2 million URL fetch, we decided to go ahead and try for 10 million. To accomplish this, in a reasonable amount of time, we’ll need to make some adjustments to Nutch and to our Router/DNS settings.
With our last fetch we set Nutch to 15 threads and it took 6 days to complete 2.2 million URLs. To crawl 5 times as much it would take 5 times as long (30 days). That’s a long time to have our server running 24/7, without any interruptions of power or Internet, in this part of the world.
To speed things up we increased the thread count to 50 but, within a minute or so, we got a lot of SocketTimeoutException errors. We fixed it by tweaking the router and server to optimize the bandwidth and doubling the http timeout setting in Nutch to 20,000. We also made the server handle all DNS records so that our ISP isn’t being hit with millions of DNS requests.
A few moments ago we started a 10.5 million URL crawl. The extra 500,000 URLs gives us some breathing room, in case of errors. so we can be sure to cross the 10 million URL mark.
NOTE: While Nutch is fetching it’s sometimes nice to know how many URLs have been fetched. Nutch doesn’t provide those numbers but it does provide a real-time log file which we can use to estimate the size of our crawl. By using the Nutch log files from our previous 6 day crawl (total of 312mb for 2.17 million urls) we can “estimate” that 1mb equals about 7,000 URLs (including errors).
Stay tuned for daily updates….
DAY 1: The speed of the fetch is really quick! We ended the first day with a 178mb log file and, possibly, 1.25 million URLs.
DAY 2: Another good day of fetching. This time the log file was 172mb. So far so good.
DAY 3: Spoke too soon. Fetching has slowed down. Not sure what is causing the congestion. We ended the day with a 145mb log file. If we had a lot of errors then the log file would have been bigger than usual. So it’s still fetching well, it’s just not fetching as fast.
DAY 4: Things have slowed significantly since Day 1. Not only can we see the speed of the fetches has slowed (visually on the screen) but the log file is now down to 130mb. If there was a problem with the connection/bandwidth then we’d see a lot of SocketTimeoutExceptions or UnknownHosts errors. But we don’t. Not sure what’s causing the problem.
DAY 5: Noticed A LOT of errors rolling up the terminal screen this morning. The log file showed Nutch was giving UnknownHost errors for about 2 hours. After a quick reboot of our router and a right-click “Repair” on our LAN connection in Windows, the fetcher started resolving requests and fetching successfully again. It even went back to the speeds of DAY 1. So we lost about two hours of fetching URLs, or about 70k pages. Today’s log file will be huge but only because of all the errors.
DAY 6: We had great day of fetching. A 194mb log file or about 1.3 million URLs.
DAY 7: Things slowed down again. Around mid-day I did another reboot of the router and repaired the LAN connection. After about 2 minutes the fetch was really quick again. We ended the day with a 188mb log file.
DAY 8: The fetching is still going strong. Today we had our best day yet. 205mb file with no unusual amount of errors.
DAY 9: We ended today with a 178mb log file. Shouldn’t be much longer.
DAY 10: The fetcher stopped at about 5am and has been processing ever since. We told Nutch not to parse the data but it appears do be parsing. Our last parse (for 2.2 million URLs) took about 10 hours. With 5 times as much data it’s going to take 5 times as long. Now that the bandwidth is available again, we have our other Nutch box crawling the entire IMDb and Wikipedia sites.
DAY 11: It’s done!
status 1 (db_unfetched): 65335873
status 2 (db_fetched): 11094454Hmmm 65m URLs unfetched. Not sure we can try for all of those, but we might go for 25m. Need to check the quality of the data first. So today we’re going to invertlinks and create the index for the 11m pages we currently have. Then we’ll be able to search the data and see what kind of results we get.
-
New room. New OS. New attitude.
Our main Nutch server has been relocated to its own room, with a dedicated Internet connection, a new CPU, and a new OS. We want to be able to use the server for other applications so we decided that, for now, we’ll use Windows Server 2003 to build up the database/index. Later on we’ll move back to a Linux based system.
Current server hardware:
- 2Ghz Duo Core CPU (upgraded from 1.8Ghz)
- 8gb DDRII RAM
- 500Gb 7200 rpm system drive
- 2tb (500gb x 4) RAID 0 (will make RAID 1 later)Software:
- Windows Server 2003 Enterprise R2
- Apache 2.2
- Apache Tomcat 6
- Nutch 1.0Right now the Nutch server is fetching data from 2.2 million URLs using 15 threads. It’s been fetching for 12 hours and should take about 5 days to complete. With large crawls we had a lot of java errors on the Linux system. We’re hoping the errors will magically disappear with the new OS. We’ll know in a few more days.
UPDATE: It’s been a little more than four days and it’s still fetching, uninterrupted, with no java errors and no DNS outages. If our calculations are correct (18,000 URLs per hour, 432,000 per day) then we should be reaching the end of the fetch sometime tomorrow. Nutch will then begin to parse all the data it gathered. That’s when the errors are most likely to happen so keep your fingers crossed.
UPDATE 2: For some reason the fetching speed dropped by 50% on Day 5 so it took an extra day for the fetch to complete. Once fetching ended Nutch began parsing the 57gb of data gathered, which took about 12 hours. Updating the database took another hour. When it completed a readdb -stats showed how many pages were fetched.
status 1 (db_unfetched): 33298710
status 2 (db_fetched): 2179320So we have almost 2.2 millions pages fetched and 33 million new links. We’re going to increase the amount of threads to 100 and test this crawl again. If our server and internet bandwidth can handle it then we’ll try to crawl those 33 million urls.


