Research

    Compressing Word Embeddings

    Abstract

    Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using large-scale unlabelled text analysis. However, these representations typically consist of dense vectors that require a great deal of storage and cause the internal structure of the vector space to be opaque. A more idealizedrepresentation of a vocabulary would be both compact and readily interpretable. With this goal, this paper first shows that Lloyd's algorithm can compress the standard dense vector representation by a factor of 10 without much loss in performance. Then, using that compressed size as a storage budget, we describe a new GPU-friendly factorization procedure to obtain a representation which gains interpretability as a side-effect of being sparse and non-negative in each encoding dimension. Word similarity and word-analogy tests are used to demonstrate the effectiveness of the compressed representations obtained.

    Links

    Citation

    @Inbook{Andrews2016-CompressingWordEmbeddings,
      author="Andrews, Martin",
      editor="Hirose, Akira and Ozawa, Seiichi and Doya, Kenji 
        and Ikeda, Kazushi and Lee, Minho and Liu, Derong",
      title="Compressing Word Embeddings",
      bookTitle="Neural Information Processing: 
        23rd International Conference, ICONIP 2016, Kyoto, Japan, 
        October 16--21, 2016, Proceedings, Part IV",
      year="2016",
      publisher="Springer International Publishing",
      address="Cham",
      pages="413--422",
      isbn="978-3-319-46681-1",
      doi="10.1007/978-3-319-46681-1_50",
      url="http://dx.doi.org/10.1007/978-3-319-46681-1_50"
    }

    Named Entity Recognition - Training from Experts

    Abstract

    Named Entity Recognition (NER) is a foundational technology for systems designed to process Natural Language documents. However, many existing state-of-the-art systems are difficult to integrate into commercial settings (due their monolithic construction, licensing constraints, or need for corpuses, for example). In this work, a new NER system is described that uses the output of existing systems over large corpuses as its training set, ultimately enabling labelling with (i) better F1 scores; (ii) higher labelling speeds; and (iii) no further dependence on the external software.

    Links

    • 'Lite' version - Poster Prize winner at Nvidia ASEAN GPU conference

    • Full Paper - Presented at IES-2015 in Bangkok, Thailand

    Citation

    @Inbook{Andrews2016-NERfromExperts,
      author="Andrews, Martin",
      editor="Lavangnananda, Kittichai and Phon-Amnuaisuk, Somnuk
        and Engchuan, Worrawat and Chan, Jonathan H.",
      title="Named Entity Recognition Through Learning from Experts",
      bookTitle="Intelligent and Evolutionary Systems: 
        The 19th Asia Paciļ¬c Symposium, IES 2015, Bangkok, Thailand, 
        November 2015, Proceedings",
      year="2016",
      publisher="Springer International Publishing",
      address="Cham",
      pages="281--292",
      isbn="978-3-319-27000-5",
      doi="10.1007/978-3-319-27000-5_23",
      url="http://dx.doi.org/10.1007/978-3-319-27000-5_23"
    }