The rise in the use of social media and particularly the rise of adolescent use has led to a new means of bullying. Cyber-bullying has proven consequential to youth internet users causing a need for a response. In order to effectively stop this problem we need a verified method of detecting cyber-bullying in online text; we aim to find that method. For this project we look at thirteen thousand labeled posts from Formspring and create a bank of words used in the posts. First the posts are cleaned up by taking out punctuation, normalizing emoticons, and removing high and low frequency words. Due to the nature of online text many of the words are misspelled either purposefully or unintentionally so a spell check software is used to check the vocabulary, ensuring spelling variations are accounted for. Using this word bank we create a term by document matrix with each post being its own document. By implementing Latent Semantic Indexing (LSI) a query can be placed to the matrix for posts that could have cyber-bullying content. Then the algorithm is trained by adjusting our methods to clean posts and revising spelling corrections for particular repetitive words. With an established approach to pruning the word bank we test our LSI algorithm on other data sets.
Bigelow, Jacob L., "Latent Semantic Indexing in the Discovery of Cyber-bullying in Online Text" (2016). Computer Science Summer Fellows. Paper 2.