Latentna semantička analiza, varijante i primjene

Marasović, Ana (2015) Latentna semantička analiza, varijante i primjene. Diploma thesis, Faculty of Science > Department of Mathematics.

[img]
Preview
PDF
Language: Croatian

Download (478kB) | Preview

Abstract

Nowadays, more and more important is to make a computer that performs tasks that man does routinely, as fast and efficiently. One of these tasks is finding a few documents from the given collection, that are most relevant for user’s query. The first step in solving this problem is representing the collection of documents as a term-document matrix, whose elements are tf-idf weights of words in the document. In this way, we represent each document as a vector in the space of terms. If the query is represented as a vector as well, standard similarity measures, such as a cosine similarity, can be used for comparison of the query and documents. In such space, synonyms will be orthogonal and polysemies will be presented with one vector, regardless of the context of the word. Motivated by this fact, and a large dimension of the term-document matrix, a lower rank approximation of the matrix is done. The approximation is gained using a singular value decomposition (SVD) of the matrix. We have shown that the approximation takes into account the context of the words. The query needs to be transformed into a new space as well, so it can be compared with vectors in this lower dimensional space. We showed how can we add new documents and terms in the case of a dynamic collection. While this method, solves the problem of synonyms to some extent, the problem with polysemies remains unsolved. In addition, LSA assumes that the data noise (gained from language variability) has a Gaussian distribution, which is not a natural assumption. The following method, pLSA, assumes that each document comes from a generative, probabilistic process, whose parameters we seek with maximization of likelihood. Each document is a mixture of latent concepts and we look for posterior probabilities of these concepts when observations are given. However, pLSA assumes these probabilities are parameters of model which leads to over-fitting of the model. Therefore, we present another model, LDA, that treats these probabilities as a distribution that depends on some parameter. Documents are, again, represented as a mixture of latent topics, but these topics are a distribution of words from the dictionary. Therefore, it is necessary to define a distribution of distributions and a natural choice is the Dirichelt distribution. Finally, we have briefly presented a topic modeling of the collection of articles from Wikipedia.

Item Type: Thesis (Diploma thesis)
Supervisor: Singer, Saša
Date: 2015
Number of Pages: 49
Subjects: NATURAL SCIENCES > Mathematics
Divisions: Faculty of Science > Department of Mathematics
Depositing User: Iva Prah
Date Deposited: 22 Oct 2015 09:57
Last Modified: 22 Oct 2015 09:57
URI: http://digre.pmf.unizg.hr/id/eprint/4163

Actions (login required)

View Item View Item