Text classification-php-v4
-
Upload
glenn-de-backer -
Category
Technology
-
view
653 -
download
0
Transcript of Text classification-php-v4
![Page 1: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/1.jpg)
Text classification in PHP
![Page 2: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/2.jpg)
Who am I ?Glenn De Backer (twitter: @glenndebacker)
Web developer @ Dx-Solutions
32 years old originally from Bruges, now living in Meulebeke
Interested in machine learning, (board) games, electronics and have a bit of a creative bone…
Blog: http://www.simplicity.be
![Page 3: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/3.jpg)
What will we cover today ?
What is text classification NLP terminology Bayes theorem Some PHP code
![Page 4: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/4.jpg)
What is text classification ? Text classification is the process of assigning classes to documents
This can be done manually or by using machine learning (algorithmically) Today`s talk will be about classifying text using a supervised machine learning algorithm: Naive bayes
![Page 5: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/5.jpg)
Supervised vs unsupervised machine learning ?
Supervised means in simple terms that we need to feed our algorithm examples of data and what they representFree gift card -> spam The server is down -> ham
Unsupervised means that we work with algorithms that finds hidden structure in unlabelled data. For example clustering documents
![Page 6: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/6.jpg)
Some possible use casesSpam detection (classic)
Assigning categories, topics, genres, subjects, …
Determine authorship
Gender classification
Sentiment analysis
Identifying languages
…
![Page 7: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/7.jpg)
Personal projectNieuws zonder politiek
![Page 8: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/8.jpg)
Personal projectNieuws zonder politiek
Fun project from 2010
Related to the 589 days with no elected government. We had a lot of political related non-news items that I wanted to filter out as an experiment.
News aggregator that fetched news from different flemish newspapers
Classified those items into political and non political news
![Page 9: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/9.jpg)
Personal project Wuk zeg je ?
![Page 10: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/10.jpg)
Personal project Wuk zeg je ?
Fun project released at the end of 2015
Inspired by a contest of the province of West Flanders to find foreign words that sounded West-Flemish
Can recognise the West-Flemish dialect… but also Dutch, French and English
Uses character n-grams instead of words
![Page 11: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/11.jpg)
NLP terminology
![Page 12: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/12.jpg)
TokenizationBefore any real text processing can be done we need to execute the task of tokenization.
Tokenisation is the task of dividing text into words, sentences, symbols or other elements called tokens.
They often talk about features instead of tokens.
![Page 13: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/13.jpg)
N-gramsN-gram are sequences of tokens of length N
Can be words, combination of words, characters, … .
Depending on the size it also sometimes called a unigram (1 item), bigram (2 items) or a trigram (3 items).
Character n-grams are very suited for language classification
![Page 14: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/14.jpg)
Stop wordsAre words (or features) that are particular common in a text corpus
for example the, and, on, in, …
Are considered uninformative
A list of stopwords is used to remove or ignore words from the document we are processing
Optional but recommended
![Page 15: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/15.jpg)
StemmingStemming is the process of reducing words to their word stem, base or root.
Not a required step but it can certainly help in reducing the number of features and improving the task of classifying text (e.g. speed or quality)
The most used is the Porter stemmer which contains support for English, French, Dutch, …
![Page 16: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/16.jpg)
Bag Of Words (BOW) model
Is a simple representation of text features
Can be words, combination of words, sounds, … .
A Bow model contains a vocabulary including a vocabulary count
![Page 17: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/17.jpg)
Training / test set
A training set is just a collection of a labeled data used for classifying data.Free gift card -> spamThe server is down -> ham
A test set is simply to test the accuracy of our classifier
![Page 18: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/18.jpg)
A typical flow
PHP is a server-side scripting language designed
for web development
![Page 19: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/19.jpg)
A typical flow
PHP | is | a | server-side | scripting | language | designed
| for | web | development
![Page 20: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/20.jpg)
A typical flow
PHP | is | a | server-side | scripting | language | designed
| for | web | development
![Page 21: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/21.jpg)
A typical flow
PHP | server-side | scripting | language | designed | web |
development
![Page 22: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/22.jpg)
A typical flowPHP : 1
server-side : 1 scripting : 1language : 1 designed : 1
web : 1 development : 1
![Page 23: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/23.jpg)
Bayes theorem
![Page 24: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/24.jpg)
Some history triviaDiscovered by a British minister Thomas Bayes in 1740.
Rediscovered independently by a French scholar Piere Simon Laplace who gave it its modern mathematical form.
Alan Turing used it to decode the German Enigma Cipher which had a big influence on the outcome of World War 2.
![Page 25: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/25.jpg)
Bayes theoremIn probability theory or statistics Bayes theorem describes the probability of an event based on conditions that might relate to that event.
E.g. how probable it is that an article is about sports (and that based on certain words that the article contains).
![Page 26: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/26.jpg)
Naive Bayes
Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes theorem
The naive part is the fact that it strongly assume independence between features (words in our case)
![Page 27: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/27.jpg)
Bayes and text classificationWe can modify the standard Bayes formule as:
Where C is the class…
and D is the document
We can drop P(D) as this is a constant in this case. This is a very common thing to do when using Naive Bayes for classification problems.
![Page 28: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/28.jpg)
Probability of a class
Where Dc is the number of documents in our training set that have this class…
and Dt is the total number of documents in our training set
![Page 29: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/29.jpg)
Probability of a class given a document
Where wx are the words of our text
What is the (joint) probability of word 1, word 2, word 3, … given our class
![Page 30: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/30.jpg)
Enough abstract formulas for today,
2 simplified examples
![Page 31: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/31.jpg)
We have the following data*word good bad total
server 5 6 11crashed 2 14 16updated 9 1 10
new 8 1 9total 24 22 46
* in reality your data will contain a lot more words and higher counts
![Page 32: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/32.jpg)
word good bad totalserver 5 6 11crashed 2 14 16
… … … …total 24 22 46
The server has crashed
(We applied a stopword filter that removes the words “the” and “has”)
![Page 33: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/33.jpg)
word good bad totalserver 5 6 11updated 9 1 10
new 8 1 9… … … …
total 24 22 46
The new server is updated
(We applied a stopword filter that removes the words “the” and “is”)
![Page 34: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/34.jpg)
NLP in PHP
![Page 35: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/35.jpg)
NlpToolsNlpTools is a library for natural language processing written in PHP
Classes for classifying, tokenizing, stemming, clustering, topic modeling, … .
Released under the WTFL license (Do what you want)
![Page 36: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/36.jpg)
Tokenizing a sentence// text we will be converting into tokens $text = "PHP is a server side scripting language.";
// initialize Whitespace and punctuation tokenizer $tokenizer = new WhitespaceTokenizer();
// print array of tokens print_r($tokenizer->tokenize($text));
![Page 37: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/37.jpg)
Dealing with stop words// text we will be converting into tokens $text = "PHP is a server side scripting language.";
// define a list of stop words $stop = new StopWords(array("is", "a", "as"));
// initialize Whitespace tokenizer $tokenizer = new WhitespaceTokenizer();
// init token document $doc = new TokensDocument($tokenizer->tokenize($text));
// apply our stopwords $doc->applyTransformation($stop);
// print filtered tokens print_r($doc->getDocumentData());
![Page 38: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/38.jpg)
Dealing with stop words
![Page 39: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/39.jpg)
Stemming words// init PorterStemmer $stemmer = new PorterStemmer();
// stemming variants of upload printf("%s\n", $stemmer->stem("uploading")); printf("%s\n", $stemmer->stem("uploaded")); printf("%s\n", $stemmer->stem("uploads"));
// stemming variants of delete printf("%s\n", $stemmer->stem("delete")); printf("%s\n", $stemmer->stem("deleted")); printf("%s\n", $stemmer->stem("deleting"));
![Page 40: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/40.jpg)
Stemming words
![Page 41: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/41.jpg)
Classification (training 1/2)$training = array( array('us','new york is a hell of a town'), array('us','the statue of liberty'), array('us','new york is in the united states'), array('uk','london is in the uk'), array('uk','the big ben is in london’), … );
// hold our training documents $trainingSet = new TrainingSet();
// our tokenizer $tokenizer = new WhitespaceTokenizer();
// will hold the features we will be working $features = new DataAsFeatures();
![Page 42: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/42.jpg)
Classification (training 2/2)// iterate over training array foreach ($training as $trainingDocument){ // add to our training set $trainingSet->addDocument(
// class $trainingDocument[0],
// document new TokensDocument($tokenizer->tokenize($trainingDocument[1])) ); }
// train our Naive Bayes Model $bayesModel = new FeatureBasedNB(); $bayesModel->train($features, $trainingSet);
![Page 43: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/43.jpg)
Classification (classifying)$testSet = array( array('us','i want to see the statue of liberty'), array('uk','i saw the big ben yesterday’), … );
// init our Naive Bayes Class using the features and our model $classifier = new MultinomialNBClassifier($features, $bayesModel);
// iterate over our test set foreach ($testSet as $testDocument){ // predict our sentence $prediction = $classifier->classify( array('new york','us'), // the classes that can be predicted new TokensDocument($tokenizer->tokenize($testDocument[1])) // the sentence );
printf("sentence: %s | class: %s | predicted: %s\n”, $testDocument[1], $testDocument[0], $prediction ); }
![Page 44: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/44.jpg)
Classification
![Page 45: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/45.jpg)
Some tipsIt is a best practice to split your data in a training and test set instead of training on your whole dataset!
If you train your classifier against the whole dataset it can happen that it will be very accurate on the dataset but performs badly on unseen data, this is also called overfitting in machine learning.
There isn’t a best split but 80-20 (Pareto principle) or 70-30 are safe ratio’s.
The numbers tells the tale! There are multiple ways of telling how accurate your classifier performs but precision and recall are a good start ! - http://www.kdnuggets.com/faq/precision-recall.html
![Page 46: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/46.jpg)
Some online PHP resourceshttp://www.php-nlp-tools.com/ - The homepage of NlpTools
http://www.phpir.com - Contains a lot of tutorials regarding information retrieval in PHP
https://github.com/camspiers/statistical-classifier - An alternative Bayes Classifier but also supports SVM
![Page 47: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/47.jpg)
Reading material
Code examples written in Java and Python but concepts can easily be applied in other languages…
![Page 48: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/48.jpg)
PHP NLP projects released as open source
php-dutch-stemmer: is a PHP class that stems Dutch words. Based on Porters algorithm. https://github.com/simplicitylab/php-dutch-stemmer
php-luhn-summarize: is a class that provides a basic implementation of Luhn’s algorithm. This algorithm can automatically create a summary of a given text. https://github.com/simplicitylab/php-luhn-summarize
![Page 49: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/49.jpg)
http://www.slideshare.net/GlennDeBacker
https://github.com/simplicitylab/Talks
https://joind.in/talk/0d9b0
![Page 50: Text classification-php-v4](https://reader036.fdocuments.net/reader036/viewer/2022081507/5885e63e1a28ab906d8b72d7/html5/thumbnails/50.jpg)
Thank you !