Department of Computer Science and Automation, Indian Institute of Science .
The event being declared open on March 1st in the CSA dept, IISc, Bangalore with a really engaging and motivating talk by Harikrishna. This will be a good starting point for the participants of this contest.
A Brief Tutorial
The following is a brief tutorial for the beginners in Machine Learning.
Solving the TwitMiner Challenge would have the following four basic parts:
- Feature Extraction
- Training using a classification algorithm
- Validation and parameter tuning
This is perhaps the most crucial step in the problem. A feature is a description of a data point (in case of the challenge, each tweet is a data point). A feature vector is a vector which mathematically encodes a feature. As a very simple example, the number of times the words "cricket" and "parliament" occur in a tweet can each represent a feature, and a 2-dimensional vector containing the counts of the two words in a particular tweet will be the corresponding feature vector for that tweet. You will need to come up with more effective features to represent tweets. Overall, by performing feature extraction, you will reduce all the tweets to vectors.
Training using a classification algorithm
In Machine Learning, datapoints are represented as a tuple (feature vector, label). Label is typically a number, that indicates the class or family of the datapoint. In the TwitMiner problem, each datapoint (tweet) can have label "politics(0)" or "sports(1)". As there are only two possible labels, this is called a Binary Classification problem. Note that these labels will be known in the training dataset, while unknown in the test data set. It is assumed that there is a function that maps a feature vector to a label - our goal is to learn this function from the training set. For this purpose, many standard algorithms (called "classification algorithms") exist. For example, the algorithms available in the Weka Toolkit (a GUI-based software for Machine Learning) or the those available with RapidMiner might be of use for this challenge; in fact, any publicly available tool can be used (you are also free to build your own classifcation algorithms tailored for this challenge).
Validation and Parameter Tuning
Once you have learned a function from the training set, you need to evaluate how well it behaves on data new unseen data. In other words, given a new datapoint (represented as a feature vector), can the learnt function map it to the correct label? The validation set is provided for this purpose. The labels of the validation set will not be provided, but on submission of the predicted labels, the accuracy (percentage of tweets which were provided a correct label by the learnt function) will be made known. If the accuracy is low, the features can be revised, or some other classification method can be used, with the hope of learning a better function.
Finally, when the test set of tweets (without labels) is available, you need to use the final learnt function to predict the labels on each of the test tweets. You will be evaluated based on the accuracy achieved by your (final learnt) function on the test set.