神刀安全网

Blazing Fast Wikipedia Search on a NUC with Strus

Wikipedia search on a NUC with Strus

Modern machines got damn powerful. We can show you how to run a fulltext search engine on the complete Wikipedia collection English (without citations, but with contents of tables) with sophisticated retrieval schemes on an Intel NUC ( NUC6i3SYK ) usingStrus. "Running" means that reacting on problems in the project is within your reach, because it is possible to rebuild the search index within one working day:

  • using one machine: less than 8 hours = less than 1 hour upacking and conversion to XML, 6:45 for building the document search index
  • using three machines: 3 1/2 hours = less than 1 hour unpacking and conversion, 2:15 for building the document search index and some minutes for copying the data.

It also means serving queries with response times expected nowadays from a search engine. Reliable numbers about how far the engine can be stressed and how many simultaneus queries can be handled are not provided yet.

But may I first introduce the machine?

It is an Intel NUC ( NUC6i3SYK with 16GB Ram and a 256GB SSD ):

Blazing Fast Wikipedia Search on a NUC with Strus

Installing the software

  1. Install an Ubuntu 14.04 as OS on our NUC. You may chose another distribution. See theStrus package list for that.
  2. Install developper software needed for the Wikipedia project:
    apt-get install git-core apt-get install cmake apt-get install libboost-dev apt-get install libboost-thread-dev apt-get install libboost-python1.54.0 apt-get install gcc apt-get install g++ apt-get install python apt-get install python-tornado apt-get install python-pip pip install tornado.tcpclient  wget https://pypi.python.org/packages/source/t/tornado/tornado-4.3.tar.gz#md5=d13a99dc0b60ba69f5f8ec1235e5b232 tar xvzf tornado-4.3.tar.gz cd tornado-4.3 python setup.py build python setup.py install
  3. Install the Strus packages for development on it:
    apt-get update wget http://download.opensuse.org/repositories/home:PatrickFrey/xUbuntu_14.04/Release.key sudo apt-key add - < Release.key apt-get update apt-get upgrade apt-get install strus strus-dev apt-get install strusanalyzer strusanalyzer-dev apt-get install strusmodule strusmodule-dev apt-get install strusrpc strusrpc-dev apt-get install strusutilities strusutilities-dev apt-get install strusbindings-python
  4. Clone the Wikipedia github project and build it:
    git clone git@github.com:patrickfrey/strusWikipediaSearch.git cd strusWikipediaSearch cmake -DLIB_INSTALL_DIR=lib/x86_64-linux-gnu/ -DCMAKE_INSTALL_PREFIX=/usr/ make make install

Build the search index

  1. Download a wikipedia chunk, split and convert it to XML chunks. We compress the XML chunk for not wasting disk space we will eventually need, when building indexes for other languages than English.
    mkdir -p data wget -q -O - http://dumps.wikimedia.your.org/enwiki/20160204/enwiki-20160204-pages-articles.xml.bz2 / | bzip2 -d -c / | strusWikimediaToXml -f "data/wikipedia%04u.xml,20M" -n0 -s -

    cd data for dd in 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24; do     fnam="wikipedia$dd"'99.xml'     while [ ! -f  "$fnam" ]         do echo waiting for $fnam         sleep 3     done     sleep 5      tar --remove-files -cvzf ./wikipedia$dd.tar.gz ./wikipedia$dd*.xml done sleep 600 for dd in 25; do     tar --remove-files -cvzf ./wikipedia$dd.tar.gz ./wikipedia$dd*.xml done cd ..

    wikipedia00.tar.gz wikipedia01.tar.gz wikipedia02.tar.gz wikipedia03.tar.gz wikipedia04.tar.gz wikipedia05.tar.gz wikipedia06.tar.gz wikipedia07.tar.gz wikipedia08.tar.gz wikipedia09.tar.gz wikipedia10.tar.gz wikipedia11.tar.gz wikipedia12.tar.gz wikipedia13.tar.gz wikipedia14.tar.gz wikipedia15.tar.gz wikipedia16.tar.gz wikipedia17.tar.gz wikipedia18.tar.gz wikipedia19.tar.gz wikipedia20.tar.gz wikipedia21.tar.gz wikipedia22.tar.gz wikipedia23.tar.gz wikipedia24.tar.gz wikipedia25.tar.gz wikipedia26.tar.gz
  2. We build the storages for 3 storage nodes and insert the dowloaded and converted XML chunks. I measured a runtime of 5 hours and 12 minutes for this on my NUC.
    #!/bin/sh  mkdir -p tmp tar -C tmp/ -xvzf $1 time strusInsert -L error_insert.log -s "path=storage;max_open_files=256;write_buffer_size=512K;block_size=4K" -R resources -m analyzer_wikipedia_search -f 1 -c 50000 -t 3 -x "xml" config/wikipedia.ana tmp/ rm -Rf tmp/

    strusCreate -S config/storage.conf for dd in 00 03 06 09 12 15 18 21 24 ; do echo "-------- $dd"; scripts/insert.sh data/wikipedia$dd.tar.gz; done mv storage storage1 strusCreate -S config/storage.conf for dd in 01 04 07 10 13 16 19 22 25; do echo "-------- $dd"; scripts/insert.sh data/wikipedia$dd.tar.gz; done mv storage storage2 strusCreate -S config/storage.conf for dd in 02 05 08 11 14 17 20 23 26; do echo "-------- $dd"; scripts/insert.sh data/wikipedia$dd.tar.gz; done mv storage storage3
  3. Watch the strusInsert program at work Blazing Fast Wikipedia Search on a NUC with Strus

    The strusInsert program at work. The high CPU usage has periodically some breaks in the commit phase. For tuning the system to other hardware you have to change the settings in scripts/insert.sh. Without the ideal settings the insert performace may degrade heavily. With a growing index the CPU usage may also drop significantly in average. But I must state here, that strus is able to deal with conventional disks. I used a 32bit maschine with a SATA disk for a long time as demo to investigate and improve the behaviour of Strus. It is by now playing fairly well with decent settings. You just have to expect insertion times about 24 hours or even beyond on a setup with conventional hard disks.

  4. After insert we initialize the document weight in the metadata table, calculated from the number of page references to that document. We do not calculate a transitive page rank, but just use the number of links pointing to that document. From this value we calculate a pageweight between 0.0 and 1.0. We have a shell script prepared for that.
    #!/bin/sh  # This script assumes that the meta data table schema has an element "pageweight Float32" defined  STORAGEPATH=storage  # Initialize the link popularity weight in document meta data (element pageweight): echo "[2.1] get the link reference statistics" truncate -s 0 resources/linkid_list.txt for ii in 1 2 3 do  strusInspect -s "path=$STORAGEPATH$ii" fwstats linkid >> resources/linkid_list.txt  echo "[2.2] get the docno -> docid map"  strusInspect -s "path=$STORAGEPATH$ii" attribute docid | strusAnalyzePhrase -n "lc:text" -q '' - > resources/docid_list$ii.txt done for ii in 1 2 3 do  echo "[2.3] calculate a map docno -> number of references to this page"  scripts/calcDocidRefs.pl resources/docid_list$ii.txt resources/linkid_list.txt > resources/docnoref_map$ii.txt  echo "[2.4] calculate a map docno -> link popularity weight"  scripts/calcWeights.pl resources/docnoref_map$ii.txt 'tanh(x/50)' > resources/pageweight_map$ii.txt  echo "[2.5] update the meta data table element pageweight with the link popularity weight"  strusUpdateStorage -s "path=$STORAGEPATH$ii" -m pageweight resources/pageweight_map$ii.txt done

Starting up the system

The setup of the system is similar to the one described in the code project article Distributing a search engine index with Strus . We have 3 storage servers (each one serving one storage built before) and one statistics server plus one web server. For simplicity we start each of the different servers in a screen. You may know how to configure them as a service on your system.

  1. Statistics server The statistics server is holding the global statistics of the collection. The statistic server holds the data that make query results from different storage nodes comparable and thus a split of the search index according our needs possible.
    screen -dmS statserver client/strusStatisticsServer.py
  2. Storage servers Each storage server node serves queries on one storage index built.
    for ii in 1 2 3 do screen -dmS storageserver$ii client/strusStorageServer.py -i $ii -c "path=storage$ii; cache=2G" -p 719$ii -P done
  3. Http server Our HTTP server based on Tornado answers the queries comming in as HTTP GET requests and returns the result as rendered HTML pages.
    screen -dmS httpserver client/strusHttpServer.py -p 80 7191 7192 7193
    1. Searching

      Blazing Fast Wikipedia Search on a NUC with Strus

      We got results, now we can start to improve our weighting scheme.

      More about the project

      There exists a description of the formal aspects of this project like the configuration and query evaluation schemes used. If you want to dig deeper, you’ll find some anchorshere.

      Online

      You might not want to build up a Wikipedia search or another search project on your own, but maybe you got curious. There is a NUC out there running a Wikipedia search for you . I am currently hosting it from my flat, so service availability is best effort.

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Blazing Fast Wikipedia Search on a NUC with Strus

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
分享按钮