Use Python to Clean Your Text Stream

bad-words-one

Facebooktwittergoogle_plusredditpinterestlinkedintumblr

Ok, Potty Mouth. Your Time is Up! .. Maybe Not?

In this post, I’m going to show you a decent Python Function (Lib) you can use to clean your text stream. Thoughts: We do “Allot” of Sentiment Analytics and use several methods, libs and use cases.  There are times when you’ll be asked to remove the “Dirty” words from a stream.  I say, Go Ahead and remove them from the visually displayed App – If there is one BUT.. there are several “Corpi” that will use the “Bad Words” to determine sentiment or context of Speak.  SO, Leave them into the sentences that are not seen and where it make sense, train your classifiers with replacement words -OR- store each changed sentence so you can re-constitute the true meaning or intent of the author.  “Holy Battle Ship, Batman!” .

Follow-Up. Thank you to a poster on Google+.  He pointed out,  this is a great example of Scunthorpe Problem.

First, The Git.. Check out “This Lib” by Jared Mess.  Great Profile Pic Jared!  It’s fairly Simple to use…

From README.md…

f = Filter('badword and bad words', clean_word='unicorn')
safe_string = f.clean()
print safe_string

If you want to Giggle (NSFW), Check out the bad_words.txt file in the Git. If you work in multi-languages you will probably be adding to this list.  It’s always fun when its your turn to update the bad word file and its your SBI in the stand-up.

Here are some “Live” examples…

kyanyoga$ python
>>> import profanity_filter
>>> product_review = "Your Product is Crap!"
>>> profane = profanity_filter.Filter(product_review, "unicorn")
>>> print ("Clean Text: %s" % profane.clean())
Clean Text: Your Product is unicorn!
>>> # At this point I added freaking and shiznit to my bad word list.
>>> movie_review = "The #coolmoviehashtag was Freaking the shiznit!"
>>> profane = profanity_filter.Filter(movie_review, "pickle")
>>> print ("Clean Text: %s" % profane.clean())
Clean Text: The #coolmoviehashtag was pickle the pickle!
>>> # Note: The following sentence would have generally positive sentiment.
>>> bar_review = "I love this freaking place! Especially the turd doorman. He made me laugh!"
>>> profane = profanity_filter.Filter(bar_review, "dolphin")
>>> print ("Clean Text: %s" % profane.clean())
Clean Text: I love this dolphin place! Especially the dolphin doorman. He made me laugh!

Be Careful, Don’t loose meaning!

I used a live example to show you how meaning could be lost.   In this case, “Unicorn” could mean a magical mystical amazing thing… But I’ll bet this product review is generally a negative sentence with “higher level meaning” representing customer frustration.  This is a clear example, When the reviews are longer (more text) and word choice is “Peppered” with colorful speech; Then, Finding Sentiment and Context becomes much harder.  Again, You may want to save the sentence before and after and run your data mining on the before sentence while displaying the after in visualization.

Use-Case Example

The following example is a Twitter Stream sentiment analytic – Movie reviews.

for item in db.tsentstrm_superman.find():
    new_tweet = item["tweet"].lower()                                   # lower
    new_tweet = re.sub(r"(?:\@|https?\://)\S+", "", new_tweet)          # remove urls
    
    # sentiment test
    sent_tweet = TextBlob(new_tweet)
...
    # run the profanity filter - Note: We wrapped Filter into Cleaner.
    profane = Cleaner(sent_tweet.words, 'unicorn')  
    
    if profane.clean():
        # print the tweets
        #print 'Date: ' + item["dt"]
        #print 'Topic: ' + item["topic"]
        print 'OldTweet' + item["tweet"]
        print 'NewTweet: ' + new_tweet
...
 

Note: This is a code excerpt [not the full example]. The point,  Just showing you that the Sentiment was ran after the Twitter text was lowered (converted to lowercase) and the urls were removed.  Then the cleaner ran on the words that were pulled from the sentence by the Texblob Sentiment function [More on Sentiment, NoSQL with MongoDB in later posts].  We usually store the sentence at multiple stages: original (with all the bad words, hashtags, urls, etc), cleaned1st (removal of special characters -urls) and most of the time a clean-visual [This is the version of the sentence that you can show your Mom].

Parting Thoughts…

At higher levels of Abstraction, Curse Words are a significant part of speech and culture.  To Remove them with Cavalier attitude would be wrong.  People can mean all manner of “Things” with a few curse words. Some good things and some .. bad.  At the same time, We don’t have to subject ourself that type of speech if we don’t want to.  There are ways to preserve meaning and still be professional in the way we aggregate and visualize.  If you’re curious, Please email me and I’ll be happy to expand this topic a bit more.  Also, I will do a more in depth series on Sentiment Analytics using Python, RapidMiner and others is future posts.

If you have questions about your IOT, EDW, AWS or Analytics project, Please Email Us – We’ll will respond ASAP.

Thank you, – Gus Segura

Please Contact US or Subscribe to our Blog if you found this interesting or would like more information.

Subscribe : Blueskymetrics Blog

* indicates required,  Managed By Mail-chimp – Please check your Spam Folder and Confirm Subscription.