Text Cleaning

Before tokenize the posts and feed them into text classification algorithms, we need to tidy up the original posts first. Text cleaning is task-oriented, for this project, we have two main tasks: 1. prepare posts for human readability on crowdsourced labeling webpage; 2. prepare posts for machine learning algorithms. The language we use to clean the post is ‘python’.

An Original Post

We will use regular expression (module re) to clean and tidy up text body of our posts. We can see from this example post, it contains URLs starting with ‘http’, different types of line endings such as ‘\n\n’ and ‘\n*’, and emojis. Also, there are unwanted words and phrases we want to eliminate.

original post
General Text Cleaning

Our next step will be conducting a general cleaning to our post, by writing series of functions to remove unwanted parts of the post, and format it for both machine and human readability. We will replace unexpected forms of line endings with ‘\n’, remove unwanted xml tags, remove unwanted common media contents like URLs and user mentions, and replace special characters with white space.

Finally, we combine all above functions into one ‘general_clean’ function and concise each post into one paragraph.

Here is how the post looks like after general cleaning.

general clean
Text Cleaning for Human Readability

After general cleaning, we can customize the format and content of our posts, in order to give viewers to future crowdsourced labeling workflow a better reading experience. We will then remove meaningless words or phrases in this step. For example, we don’t want ‘[Original post Day 1 Day 2 …]’ in the post which contains little meaning to the main topic.

Let’s see the further cleaning result for human readability.

clean4human
Text cleaning for Machine Learning

Tokenizing and lemmatizing are initial steps in machine learning algorithms for text, but it will help enhance the accuracy of tokenized dictionaries, if we manually remove punctuations and expand English language contractions at first.

Most of the NLP algorithms (spacy, tensorflow, nltk) offer sophisticated preprocessing workflows including removing punctuations, lowercase each token and analyze compositions of a string.

Let’s see the further cleaning result for machine learning.

clean4machine

The code and results can be found and downloaded here:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: