EZ Study

Actuarial Biology Chemistry Economics Calculators Confucius Engineer
Physics
C.S.

SAS Text Mining Tutorial by Examples

What is Data Mining?
What is Data Mining SEMMA process? Pattern discovery applications?

What is Text Mining? Descriptive mining & predictive mining?
What is SAS Text Miner? Text Parsing, Text Filter?
What is Sentiment Analysis? Text Mining application areas?
SAS Text Mining Tutorial by Examples

• From Text to Numbers: How to transform non-quantitative to quantitative

We are using the classic example of stylometric investigation of the 77 Federalist Papers to explain SAS text mining step by step.

Alexander Hamilton, James Madison, and John Jay wrote a series of essays in 1787 and 1788 to try to convince the citizens of the state of New York to ratify the new constitution of the United States. These essays are collectively called The Federalist: A Collection of Essays. Copies of the papers in a variety of formats can be found at avalon.law.yale.edu or download a copy at here.

Of the 85 essays, 51 are attributed to Hamilton, 14 to Madison, 5 to Jay, and 3 to Hamilton and Madison jointly. The 12 remaining essays can be attributed only to Hamilton or Madison.

After we filter out 7 noisy document(Filter node), make sure our target variable to be binary(Metadata Node), set up the synonyms and stoplist(Text parsing), term weight(Text filter), we transform the text to number to by the following Linear Algebra approach to quantify documents:
• For each document, there is a corresponding vector to the ordered terms.
• For each query(combination of terms), there is also a vector for that.

Text filter node is usually processed before Text parsing node. For the choice of the term weight: entropy is the default term weight when no target variable is available.
Open the Filter Viewer from the Text filter node: you can select some term to search, to find out all statements containing that word.

For the Text cluster node: choosing the SVD dimension is more of an art than a science. Note that the heuristic employed would derive a smaller number of dimensions if the Max SVD Dimensions was set at 100. You can get the same results by setting SVD Resolution to High and Max SVD Dimensions to 98. If the node has to be re-run, this will give slightly better performance than setting SVD Resolution to Low and Max SVD Dimensions to 200.

Another common practice is to set the cluster number based on domain knowledge. For example, you might request 11 clusters hoping to get clusters to conform to the 11 automotive systems. Unfortunately, if the derived clusters do not correlate well with the known systems, then you will probably want to revert back to using the cubic clustering criterion.

We can use their inner product to get the score of that document to a specific query; then apply dimension reduction techniques to SVD(Singular Value Decomposition).

Algorithms process documents (parsing/filtering).
• A derived vector is associated with each document.
• The vector is typically too large and has too many zeros to work with directly, so transformation methods and dimensionality reduction techniques are applied to produce a more useful final vector representation for each document.

As pointed out from the process flow, we are using those two quantitative variables: SVDs and topic clusters to run a logistic regression on the target binary variable. Remember to check out the predicted author by using the variables: target and i_target.