SAS Text Mining Tutorial by Examples
• What is Data Mining?
• What is Data Mining SEMMA process? Pattern discovery applications?
• What is Text Mining? Descriptive mining & predictive mining?
• What is SAS Text Miner? Text Parsing, Text Filter?
• What is Sentiment Analysis? Text Mining application areas?
• SAS Text Mining Tutorial by Examples
• From Text to Numbers: How to transform non-quantitative to quantitative
We are using the classic example of stylometric investigation of the 77 Federalist Papers to explain SAS text mining step by step.
Alexander Hamilton, James Madison, and John Jay wrote a series of essays in 1787 and 1788 to try to
convince the citizens of the state of New York to ratify the new constitution of the United States. These
essays are collectively called The Federalist: A Collection of Essays. Copies of the papers in a variety of
formats can be found at avalon.law.yale.edu
or download a copy at here
Of the 85 essays, 51 are attributed to Hamilton, 14 to Madison, 5 to Jay, and 3 to Hamilton and Madison
jointly. The 12 remaining essays can be attributed only to Hamilton or Madison.
After we filter out 7 noisy document(Filter
node), make sure our target variable to be binary(Metadata
set up the synonyms and stoplist(Text parsing
), term weight(Text filter
), we transform the text to number to by the following Linear Algebra approach to quantify documents:
• For each document, there is a corresponding vector to the ordered terms.
• For each query(combination of terms), there is also a vector for that.
node is usually processed before Text parsing
node. For the choice of the term weight: entropy
is the default
term weight when no target variable is available.
Open the Filter Viewer from the Text filter
node: you can select some term to search, to find out all statements containing that word.
For the Text cluster
node: choosing the SVD dimension is more of
an art than a science.
Note that the heuristic employed would derive a smaller number of dimensions if the
Max SVD Dimensions was set at 100. You can get the same results by setting SVD Resolution to
and Max SVD Dimensions to 98
. If the node has to be re-run, this will give slightly better
performance than setting SVD Resolution to Low and Max SVD Dimensions to 200.
Another common practice is to set the cluster number based on domain knowledge. For example, you
might request 11 clusters hoping to get clusters to conform to the 11 automotive systems.
Unfortunately, if the derived clusters do not correlate well with the known systems, then you will
probably want to revert back to using the cubic clustering criterion.
We can use their inner product to get the score of that document to a specific query; then apply dimension reduction techniques to SVD(Singular
Algorithms process documents (parsing/filtering).
• A derived vector is associated with each document.
• The vector is typically too large and has too many
zeros to work with directly, so transformation methods
and dimensionality reduction techniques are applied to
produce a more useful final vector representation for
As pointed out from the process flow, we are using those two quantitative variables: SVDs and topic clusters to run a logistic regression on the target binary variable.
Remember to check out the predicted author by using the variables: target and i_target.
Continue What is SAS Text Miner: overview
Back to What is Text Mining?
Prepare for SAS interview?