Marasović, Ana (2015) Latentna semantička analiza, varijante i primjene. Diploma thesis, Faculty of Science > Department of Mathematics.

PDF
Language: Croatian Download (478kB)  Preview 
Abstract
Nowadays, more and more important is to make a computer that performs tasks that man does routinely, as fast and efficiently. One of these tasks is finding a few documents from the given collection, that are most relevant for user’s query. The first step in solving this problem is representing the collection of documents as a termdocument matrix, whose elements are tfidf weights of words in the document. In this way, we represent each document as a vector in the space of terms. If the query is represented as a vector as well, standard similarity measures, such as a cosine similarity, can be used for comparison of the query and documents. In such space, synonyms will be orthogonal and polysemies will be presented with one vector, regardless of the context of the word. Motivated by this fact, and a large dimension of the termdocument matrix, a lower rank approximation of the matrix is done. The approximation is gained using a singular value decomposition (SVD) of the matrix. We have shown that the approximation takes into account the context of the words. The query needs to be transformed into a new space as well, so it can be compared with vectors in this lower dimensional space. We showed how can we add new documents and terms in the case of a dynamic collection. While this method, solves the problem of synonyms to some extent, the problem with polysemies remains unsolved. In addition, LSA assumes that the data noise (gained from language variability) has a Gaussian distribution, which is not a natural assumption. The following method, pLSA, assumes that each document comes from a generative, probabilistic process, whose parameters we seek with maximization of likelihood. Each document is a mixture of latent concepts and we look for posterior probabilities of these concepts when observations are given. However, pLSA assumes these probabilities are parameters of model which leads to overfitting of the model. Therefore, we present another model, LDA, that treats these probabilities as a distribution that depends on some parameter. Documents are, again, represented as a mixture of latent topics, but these topics are a distribution of words from the dictionary. Therefore, it is necessary to define a distribution of distributions and a natural choice is the Dirichelt distribution. Finally, we have briefly presented a topic modeling of the collection of articles from Wikipedia.
Item Type:  Thesis (Diploma thesis) 

Supervisor:  Singer, Saša 
Date:  2015 
Number of Pages:  49 
Subjects:  NATURAL SCIENCES > Mathematics 
Divisions:  Faculty of Science > Department of Mathematics 
Depositing User:  Iva Prah 
Date Deposited:  22 Oct 2015 09:57 
Last Modified:  22 Oct 2015 09:57 
URI:  http://digre.pmf.unizg.hr/id/eprint/4163 
Actions (login required)
View Item 