June 8, 2017

Python and R Resources for text-mining

This is just a growing collection of useful Python and R packages and resources for text-mining. It's the flavor of ML I'm working on the most, so this is mostly just for my own hassle-avoidance.

Work in progress.

Python

Libraries

  • NLTK — the 1000-pound gorilla here.
  • Gensim — very frequently used topic modeling library, also has word2vec wrapper that seems to get a lot of usage.
  • Textmining – a very small library that creates DTMs and stuff. A nice alternative to wading through tons of NLTK documentation.
  • spaCY — NLTK alternative, I've not tried it.
  • TextBlob — not used it, but it looks to do a very significant subset of common text processing tasks with a relatively rational-looking API.
  • BeautifulSoup — the standard python library for getting your texts out of web pages (agonizingly complicated API, but, then again, that's probably the DOM's fault). Incidentally, you should really be using requests to actually send http requests for scraping.
  • Scrapy — another important Python webscraping tool, I've actually never used it.
  • wordcloud — word clouds are always fun. Kinda useless, but fun.
  • Pattern — a library that combines web-scraping with some standard NLP tools.
  • CoreNLP Wrappers Python wrappers for Stanford Core NLP.

Publications, Tutorials, etc.

(to be added)

R

Libraries

TM — The classic package, but I hate it like sin. On the other hand, its agonizingly horrible API is the source of my most-upvoted Stack Overflow answer, so, thanks? This package and strings-as-factors together drove my abandonment of R for Python.

  • Quanteda — Ken Benoit's TM alternative.
  • tidytext — I haven't used this yet, but given the authors (Robinson and Silge), it's probably amazing.
  • ToPan — a cool batteries included topic modeling Shiny app specifically designed for topic modeling in ancient languages.
  • LDA
  • Structural Topic Model – Molly Roberts, Brandon Stewart and Dustin Tingley package widely used among political scientists.
  • Topicmodels
  • wordcloud
  • WordVectors — word2vec implementation.

Publications, Tutorials, etc.

Tags: machine-learning text-mining datascience r python