Računalna obrada teksta

Marohnić, Janko (2015) Računalna obrada teksta. Diploma thesis, Faculty of Science > Department of Mathematics.

[img]
Preview
PDF
Language: Croatian

Download (1MB) | Preview

Abstract

Because of increasing growth of digital information, advanced methods of searching large amounts of text are required. Traditionally relational databases were doing the job of searching information, but they are good only for structured and discrete searching. Full-text search requires advanced methods of text processing, so that the question how much the given query matches a document can be answered. Full-text search requires that the text is processed upfront, so that maximum query execution speed can be achieved. The text is first split into sentences, then each sentence is tokenized, which enables searching text to become searching a list of tokens. Afterwards all tokens are downcased and have their diacritics normalized, which simplifies the character set needed for queries. Then stopwords (words that don’t bring additional meaning) are removed from the token list and stemming is applied for each token, which enables a keyword in the query to find all documents with any variation of that keyword (e.g. plural or singular). Finally the processed list of tokens are saved in a structure called the index, which is then used for searching. After indexing it is possible to query the created index. The text of the query alone is first processed in the same way as the index, with an addition that each token is also expanded with its synonyms, to increase the number of relevant documents returned. At query time the text of the query is analyzed for phrases (sequence of words inside double quotes), boolean operators, wildcards and regular expressions. It’s possible to also include structured search in the query. When a user is typing the query, it can be useful to try to autocomplete the query in a box below. At query execution time it’s also good to tolerate possible typos. Finally, since this kind of searching usually yields many results, it’s necessary to rank the documents by relevancy, which can be affected in many ways. Besides automatic relevancy based on the number of appearances of keywords in the document, it’s possible to specify that some fields are more important than others (e.g. title), and rank the query which is found in the title higher. Also, it’s good to rank documents by order and distance between keywords from the query. For a long time web pages had search powered by commercial search engines like Google. However, during that past 5 years a great number of open source fulltext search engines have evolved. The most popular ones are Apache Solr, Sphinx, PostgreSQL (which received support for full-text search) and Elasticsearch. Through testing and analysis of features it was concluded that Elasticsearch prevailed. Big number of people are using quality search engines like Google in everyday life, and it becomes a mission of web applications to try to bring the quality of their search as close as possible to that standard.

Item Type: Thesis (Diploma thesis)
Supervisor: Manger, Robert
Date: 2015
Number of Pages: 40
Subjects: NATURAL SCIENCES > Mathematics
Divisions: Faculty of Science > Department of Mathematics
Depositing User: Iva Prah
Date Deposited: 03 Sep 2015 12:38
Last Modified: 03 Sep 2015 12:38
URI: http://digre.pmf.unizg.hr/id/eprint/4221

Actions (login required)

View Item View Item