Skip to main content

Raincheck for SoundCloud : Detecting spam users in SoundCloud

Raincheck for SoundCloud

Soundcloud is an audio distribution platform that enables its users to upload, record, and share their originally-created sounds.
Bot/fake accounts are a persistent problem on SoundCloud where, they are used by rich and undeserving musicians to get more followers. Because of this reason, musicians which are genuinely talented go unnoticed whereas plagiarized/undeserving music(by bot/fake accounts) becomes famous. Below is an infographic that explains the motivation behind our work. The primary objective of our project was to to prevent users from following untrustworthy SoundCloud accounts, and to make the playing ground level for the artists that are actually worthy.




Therefore, to solve the above problem, we built a chrome extension that detects profiles which use/are like bots on SoundCloud by validating a machine learning model to identify fake accounts using our feature vector described below.


Methodology


We crawled SoundCloud’s website in a BFS manner(followers of a follower and so on) and extracted the following features using web-scraping:-
    • Follower and Following Count
    • Profile picture
    • Number of tracks posted
    • Suspicious Usernames

Total number of users crawled- 2000
Total number of users manually annotated(every user was annotated by 3 people)- 300(ground truth)


An account will be a bot if:
    • Follower-Following ratio will be skewed in one direction but keep in mind the date of creation of profile.
    • No profile pictures uploaded.
    • Pornographic picture uploaded.
    • Less than 𝞪  posts uploaded where 𝞪 is an arbitrary number chosen on the basis of our ground truth.
    • Repeating same activity.
Based upon our feature vector above, we created the ground truth through manual annotation. We then trained our model using 3 supervised machine learning techniques, SVM, linear regression and decision tree classifier. We did not use neural nets as the ground truth dataset was not sufficient in size.
We classified our fake accounts into 3 categories:-
  • Genuine users  - Users which appear genuine and real
  • Ambiguous users - Users for which we are not sure if they are bots/fake
  • Fake users - Users which are bots/fake


Results
Figure 1 : Graph representing the distribution of type of users in our study

Figure 2: Graph representing the distribution of type of profile pictures of fake profiles
                                                          
Figure 3: Skewed follower-following ratio for fake profiles

Figure 4: Accuracy vs User count graph for 3 different methodologies
As shown in Figure 1,  the number of fake users/bots on SoundCloud came out to be 14% i.e 280 of 2000 undeserving musicians are getting the attention of a huge number of followers with the help of bots. This is a huge number and hence cannot be ignored.
The behaviour of our features is described in Figure 2 and Figure 3.
Profile picture came out to be an important feature of our detection as 57% of fake users do not have any picture uploaded and 29% of users have a pornographic/inappropriate picture(Figure 2). Also, it can be clearly seen from Figure 3, that follower-following ratio is marginally skewed in one direction as in the case of fake users as compared to genuine musicians.
Finally as shown in Figure 4,  the accuracy of our methodology came out to be around 80%( maximum in case of decision tree classifier).


Conclusion:
The ML model that was trained as explained was exposed through a chrome extension. The chrome extension warned users when they opened the page of a suspected fake user, and gave the green light when a genuine user’s profile was opened. Using this tool, users will be more informed. This data could also be used by SoundCloud employees while deciding whether a profile is fake.



Disclaimer : All images used in this blog have been created or captured by us with the exception of the following picture captured by Siddharth Arya.

The Team 
From L-R : Rishi Mohan, Ishita Verma, Prachi Singh, Arushi Kumar, Akshat Sharda, Anisha Sejwal

Comments

Popular posts from this blog

Identifying Tinder Profiles on Facebook

Identifying Tinder Profiles on Facebook In the online world, everything that you ever put is linked and connected. You might think that you’ve put some information on one platform and that’s it, you’re good to go. But you, my friend, are sadly mistaken. With this thought in mind and the privacy concerns linked with Online Social Media, we would like to introduce you to our problem statement: Identifying Facebook Profiles from Tinder Profiles. Given a tinder profile, our aim is to identify the corresponding Facebook profile of that person. We are addressing the linkability issue here and trying to highlight how more information than what you’ve mentioned on Tinder can be picked up from your Facebook profile. For those who don’t know, Tinder is a Dating Platform available for a Mobile Application and a Web App. It shows the geographically close profiles around you and you have an option to right swipe(Like) or left swipe(Dislike) them. When two people right swipe each other then it’

iFROOSN: Incentivised Fake Reviews On OSNs with Yelp as the reference

Yelp is an OSN primarily used to popularise the businesses and give reviews about those business. Yelp can be used as an efficient business expander for many upcoming restaurants/spas/saloons who always look for new customers. Problem Statement Our main objective of this course project was to target fake/incentivised reviews on yelp and give a credibility score using which a new user of Yelp can get an overall estimate about the restaurant he/she will visit .We developed an application which required an business ID of yelp as an input and it gave the credibility score as the output along with some inferred results in form of graphs Dataset The primary requirement before starting the project was collecting dataset for Yelp business and corresponding reviews and details about the user which post these reviews .The dataset was obtained through Yelp dataset challenge which was available for academic usage and result collections .The database had predefined schema and

Privacy Control

Online social networks have become an important part of our social lives, and their inherent privacy problems have become a major concern for users. As of March 2016, 142 million Indians maintain a social network profile on Facebook and 30 million on Twitter, which provides them with a convenient way to communicate with family, friends and even total strangers. The Services provided by social media though add convenience to our life to a great extent and have made the world a much closely connected, this boon comes with few hidden problems. Though social media lets users share a part of our life to the world, it also gives birth to the security threats to our personal information.  The users are confronted with a dichotomy between sharing information with their loved ones and friends and sharing information with everyone else on the internet. To help users tackle this dilemma, social networks provide a plethora of privacy settings which allow the user to control his/her pri