Project 4

Latent Semantic Analysis via Singular Value Decomposition

By Michael Chungkun Chen

SEASID: mchen

Computer Science 161 Artificial Intelligence

December 3, 2001 Monday

Introduction

In short the goal of the project was to experiment with the Latent Semantic Analysis model and examine vector representations between words.

The language used to implement this project is C++. This language was chosen partially due to it's advanced data structure capability, and object oriented approaches; and partially due to my desire to brush up on classes and STL.

Process

Before work could be done with the Latent Semantic Analysis model, we needed a large corpus of text represented in a word-paragraph matrix. I used a sparse matrix representation while I gathered data from the files. To do this, I needed a way to store the elements. I decided upon a link list of my own datatypes. The link list was provided by the STL found standard in C++ compilers.

Another necessity that needed to be handled was file parsing, and stopword elimination. In order to do this on a more intuitive basis, I created a file tokenizing class. Which strips stopwords, detects ends of paragraphs (blank space between bodies of text), word length limits, and number of paragraph limits.

Once I created a link list of all the words in each paragraph, I collected same words throughout the whole text. This was used to find all words which only appeared only once. These words were eliminated by adding them to the badword list in the file tokenizer. Thus this results in the two pass system. The second pass through the file collects the final word-paragraph sparse matrix. Throughout this process I was aiming at creating something that was on the order of n^2 or less. We are dealing with large enough quantities of data where this would be a concern of ours.

I also needed a word list to convert words to their corresponding row index. So taking the result from collecting the same words, the program sorts it alphabetically and place the words in an array of strings. This allows us to search for the index of a word on the order of log n.

Once all this preprocessing is done, I need to decompose it. To do this I use the las2 svd program provided for us. To use this program, I had to output the sparse matrix into a file in the Harwell-Boeing format. The svd program was converted into a function, which I called from the main function. This decomposition takes a great deal more time than the previous steps.

After the decomposition, the results are stored in the file singluar_values, and lav2. My program then reads the two files and stores the w, s, and p matrices in the memory.

With this information we can now do the analysis on the words and corpus of text.

Analysis

Unfortunately I have not been able to process the whole corpus without either a memory stack overflow, or taking up an inordinate amount of time. In this case all analysis was make using 1000 paragraphs from the text or less.

First thing we notice is that the five most common words are "pierre, prince, anna, princess, count" in order of highest frequency. With this little piece of information, we can surmise a little. The text is about a prince in someway, Pierre is the name of someone mentioned often in the text. Anna is also someone mentioned often, we also read about princesses, and counts.

Just with the most frequently used words, we can know some information, but not a lot. This is why we go further and analyze the cosign distance. The following is an excerpt from the capture of a program run:

Part III Most similar words by cosine distance

[princess] closest matched to [anna] with a cosine distance: 0.27684

[natasha] closest matched to [sonya] with a cosine distance: 0.467208

[army] closest matched to [talks] with a cosine distance: 0.516742

[young] closest matched to [nicholas] with a cosine distance: 0.272878

[suddenly] closest matched to [collar,sideways] with a cosine distance: 0.311138

When we examine this information, we can make some assumptions, like princess is closest matched to anna might be because the princess's name is Anna. Natasha and Sonya matched each other probably because natasha sonya refers to the same person, or they are always together. Army and talks is an unusual match, it may refer to talks that generals gives to the army, or talks of an army coming to town. And surprisingly this gives us the greatest cosine distance of 0.517. Young is matched with nicholas suggests that there is a relationship there, perhaps Nicholas is a young person, and is referred to young Nicholas most of the time. Suddenly is matched to collar and sideways with equal cosine distance. The significance of these matches are lost on me.

With the cosine distance of the sparse word paragraph matrix, we can only surmise the similarity in the usage of the words. In order to predict other more fundamental characteristics, we look upon the cosine distance in the reduced matrix. First with a K value of 50, and later with a K value of 221, since increasing it to 300 took forever; we examine the new cosine distances. The following is the LSA result of K=50:

[princess] matched [prince] Reduced Cosine Distance = 0.700816

[natasha] matched [sonya] Reduced Cosine Distance = 0.745942

[army] matched [join] Reduced Cosine Distance = 0.877697

[young] matched [little] Reduced Cosine Distance = 0.576093

[suddenly] matched [onto] Reduced Cosine Distance = 0.670823

Now we see a more fundamental relationship. Princess is the female version of prince, which gives this match much meaning. Natasha once again matches Sonya which strengthens my conclusion that the person's full name is Sonya Natasha, or they are both in similar roles. Here we see a fundamental relationship between army and join. For example you "join the army" or the army requires people to join. And we also see the word young matched up with the word little. This has much meaning for when one is young, they are often little, and have little experience. Suddenly gives a match to onto here, suggesting that phrases like "jumps suddenly onto" or similar ones were found often there.

We notice that the reduced cosine distances are all much greater than the normal cosine distances. This means that each word is closer to one another in the K dimensional representation than it is in the original dimensionality.

We also look at the reduced cosine distances with another k dimensionality. The k dimensionality is 221 in this example, but it was supposed to be 300 unfortunately the svd function could only obtain 221 singular values before one of the boundaries were reached. The following is a capture of the output:

[princess] matched [prince] Reduced Cosine Distance = 0.419555

[natasha] matched [sonya] Reduced Cosine Distance = 0.647572

[army] matched [join] Reduced Cosine Distance = 0.723891

[young] matched [nicholas] Reduced Cosine Distance = 0.388559

[suddenly] matched [face] Reduced Cosine Distance = 0.444461

We can see that the first three matches are the same. The young matched back to nicholas as it did when it wasn't reduced in dimensions. And suddenly matched face which is another mysterious match. We notice that the cosine distances are smaller here. It would appear that the more dimensions we work with, the less the reduced cosine distance.

We also try to determine the least similar pairs of words. In order to do this though we had to go through each word and compare it with another word and determine the smallest result of every one. The following pair was found in our example.

k=50

[latest] matched [object] With Cosine Distance of -0.466285

k=221

[agitated] matched [winking] With Cosine Distance of -0.250617

This suggests that latest has nothing to do with object in this dimensionality since the cosine distance is negative. This part of the problem took the greatest amount of processing. And couldn't be made more efficient with the time allotted.

Conclusion

With Latent Semantic Analysis, there are many applications that might be feasible. For example we can summarize each paragraph by printing out 5 or 10 of the highest valued words for that paragraph column. The values indicate the likeliness of that word appearing in the paragraph. And stopwords aside, these words could indicate what the paragraph was about. Although words might not be enough for the human reader to understand, so we can take it one step further, and display the sentences in the paragraph that contain those words, and not display the sentences that don't contain the words.

Another application to the LSA model is conversational artificial intelligence memory. We can take an AI chat engine, and add a sparse LSA reduced matrix to serve as it's associational memory since it has been shown in many papers that the LSA model scores similarly to humans in certain standardized tests. Have a neural net to serve as temporary memory which determines the AI's train of thought, and built in grammatical and syntactical rules using formal languages to form coherent sentences. However the LSA reduced matrix is hard to update, since you need to redecompose, and recompose it from the original frequency data. This would pose a problem, because ideally we would be able to take every new sentences typed to the AI as part of it's corpus of text, and recompute a new LSA reduced matrix. This should create an authentic method to allow the AI to learn new topics. And reinforce old data, or correct or clarify ambiguous word relationships. Due the recomposition problem, the HAL model might prove to be a better model for this type of application, a continually updating corpus of knowledge base.

In conclusion, the Latent Semantic Analysis model does have some merit, and seems to be capable of deriving meaning, however the complexity of matrix processing and the huge amount of text needed to be process causes time and space issues. Given more time, both these issues could be overcome, but for now our exploration with LSA is limited.