Stringi Package in R
While base R as well as stringr functions are good for only simple text processing,we needed better packages for dealing with more complex problems such as natural language processing.
Hence, stringi is a package which provides replacements for nearly all the character string processing functions known from base R. It also provides high performance and portability of its facilities .Some of its many features include text sorting, text comparing, extracting words, sentences and characters, text transliteration, replacing strings,etc.
Following is a list of the commands in R available under stringi package which can be used for text processing :
#Install and load stringi
>install.packages(“stringi”)
>library(stringi)
#Consider this object “test” which is a review about Iphone 6
>test<-“I loved my i5 but hate the i6. To be fair, the display quality is much better and the camera/photo resolution is amazing. Some will find the privacy feature on the browser to be a welcome change. However, I have average size hands and find that the buttons are all in the wrong places. I frequently put the phone into sleep mode when trying to text (due to the placement of the sleep button on the side) and unless you have super-long monkey fingers and thumbs it is very difficult to cover the span of the keyboard (when turned sideways) and impossible to work one/handed (just try reaching those icons in the upper and left portions of the phone with your thumb). I took advantage of Apple’s free trade-in offer, but I’m going in tomorrow and asking for my old i5 back. The enhanced display and camera resolution simply can’t make up for the increased difficulty and hassle to operate.”
# stri_split_boundaries
#Extract words :Input is a character variable
# Extract sentences
>test1<-stri_split_boundaries(test, stri_opts_brkiter(type=”sentence”))
>test1
#Extract characters
>stri_split_boundaries(test, stri_opts_brkiter(type=”character”)) # extract characters
##The following code is to convert test1(which is a list) to character format
#Since there are 7 sentences create a vector words with 7 characters
>words<-c(“H”,”H”,”H”,”H”,”H”,”H”,”H”)
#The following loop populates “Words” with the 7 sentences in character format
#This is because stri_extract_words,stri_replace_all_fixed and such functions take in input as character string and not a list
>for(i in 1:7)
{ words[i]<-test1[[1]][i];print(i) }
>words
>class(words)
#Extracts words in a string
>stri_extract_words(test)
>stri_extract_words(words) ##Gives the wordlist for each sentence
#Counts words in a string
>stri_count_words(test)
## [1] 164
>stri_count_words(words) ##Gives the word count for each sentence
## [1] 8 16 14 17 70 21 18
#Determine whether a string starts or ends with a given pattern.
>stri_startswith_fixed(words, “I”)
#stri_replace_all_* : Replaces a word with another word based on conditions
#stri_replace_all_* gained a vectorize_all parameter, which defaults to TRUE for backward compatibility.
#In this example, amazing and welcome are replaced with “excellent” and “good”
>stri_replace_all_fixed(words,c(“amazing”,”welcome”), c(“excellent”,”good”), vectorize_all=FALSE)
## stri_replace_all_fixed
#Here we are comparing between vectorize_all=FALSE and vectorize_all=TRUE
#This replaces the given string with another string
>stri_replace_all_fixed(“The white color iphone 6S is more appealing to the customers than iphone 5s”,c(“appeal”, “white”), c(“interest”, “red”), vectorize_all=TRUE)
stri_replace_all_fixed(“The white color iphone 6S appeals more to the customers than iphone 5s”,c(“appeal”, “white”), c(“interest”, “red”),vectorize_all=FALSE)
## stri_replace_all_regex
# Compare the results:
#Here we are comparing between vectorize_all=FALSE and vectorize_all=TRUE
>stri_replace_all_fixed(“The white color iphone 6S is more appealing to the customers than iphone 5s”,c(“appeal”, “white”), c(“interest”, “red”), vectorize_all=FALSE)
>stri_replace_all_regex(“The white color iphone 6S appeals more to the customers than iphone 5s”,”\\b”%s+%c(“appeal”, “white”)%s+%”\\b”, c(“interest”, “red”),vectorize_all=FALSE)
##The following command helps us to filter only valid email id’s
>stri_subset_regex(c(“john@office.company.com”, “steve1932@g00gl3.eu”, “No email here”,”abi20hotmail.com”),”^[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\\.)+[A-Za-z]{2,4}$”)
#For complete references to regex’s refer : http://docs.rexamine.com/R-man/stringi/stringi-search-regex.html
#stri_split_fixed
#If you want to split sentences based on “;”,”_” or any other metric
>stri_split_fixed(c(“ipone5s->bad”, “ipone6s->good”, “phone”, “”), “->”, n_max=1, tokens_only=TRUE, omit_empty=TRUE)
>stri_split_fixed(c(“ipone5s->bad”, “ipone6s->good”, “phone”, “”), “->”, n_max=2, tokens_only=TRUE, omit_empty=TRUE)
#stri_list2matrix
#Helps you to convert lists of atomic vectors to character matrices
>stri_list2matrix(stri_split_fixed(c(“ipone5s->bad”, “ipone6s->good”, “phone”, “”), “->”, n_max=2, tokens_only=TRUE, omit_empty=TRUE))
Related Articles:
Memory Management in R and how it Handles Big Data
How to Create a Word Cloud in R
Examples of How R is Used