Abstract | ||
---|---|---|
We show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents. The values of the latent variables in the deepest layer are easy to infer and give a much better representation of each document than Latent Semantic Analysis. When the deepest layer is forced to use a small number of binary variables (e.g. 32), the graphical model performs ''semantic hashing'': Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by simply accessing all the addresses that differ by only a few bits from the address of the query document. This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method. By using semantic hashing to filter the documents given to TF-IDF, we achieve higher accuracy than applying TF-IDF to the entire document set. |
Year | DOI | Venue |
---|---|---|
2009 | 10.1016/j.ijar.2008.11.006 | Int. J. Approx. Reasoning |
Keywords | DocType | Volume |
Information retrieval,large set,approximate matching,information retrieval graphical models unsupervised learning,deepest layer,graphical model,Latent Semantic Analysis,Unsupervised learning,deep graphical model,Graphical models,entire document set,semantically similar document,better representation,query document | Journal | 50 |
Issue | ISSN | Citations |
7 | International Journal of Approximate Reasoning | 248 |
PageRank | References | Authors |
17.09 | 18 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ruslan Salakhutdinov | 1 | 12190 | 764.15 |
geoffrey e hinton | 2 | 40435 | 4751.69 |