Department of Computer Science and Automation, Indian Institute of Science .
The contest will be declared open during the Machine Learning event presentation on March 1st, 2013.
After the event has been declared open, the datasets (first two) will be available for download. We have three different datasets:
- Training data consisting of <tweetid, userid, category label>
- Validation data consisting of <tweetid, userid>. The participating team can upload the output labels produced by their algorithm corresponding to this validation data. The scores will be displayed immediately to assist the team to evaluate themselves.
- Test data consisting of <tweetid, userid>. This data will be published only on the day of final submission and not before that. The performance of the model on this test data only will be used for judging winners.
The datasets will be available for download from the ‘Portal’ tab in this website. Along with the datasets, the participants will find a script to download the tweettexts from Twitter.com. The instructions accompanying the script should be followed and strictly adhered to. The participants need to use the tweettexts and the category label from the training data to learn a classifer and use this classifier to produce outputs for the validation data and the test data. Note that USE OF EXTERNAL DATA SOURCES (OTHER THAN THE TWEET TEXT) WHILE MAKING PREDICTIONS ON THE VALIDATION / TEST DATA IS STRICTLY PROHIBITED . Also, you are NOT ALLOWED to use the user id / tweet id while learning a model or making predictions on the validation / test set. (Please note the category label will be either ‘Politics’ or ‘Sports’. As is evident this is a binary classification problem.)
Every submission (in the correct desired format) of the output labels will be evaluated and scores will be assigned and will be eligible for display in the leaderboard. Only the best score per team will be displayed in the leaderboard. The leaderboard will remain on display throughout the contest duration and the teams can upload the output labels for the validation data several times and keep improving their algorithm. On the final day of submission, the test data will be available for download. The candidates need to submit the output labels for this test data, a writeup of their algorithm and the code they have used. Please look for instructions about the desired formats of submission in the ‘Portal’ tab.
The candidates are required to upload only the output labels produced by their model in the desired format. These labels will be matched in the backend with the actual labels and the classification accuracy will be shown(that is the fraction of correct labels). The team with the highest classification accuracy on the test data on the day of final submission will be adjudged as the winner.
The data can be expected to be noisy (there might be some tweettexts which may have incorrect labels). Hence, preprocessing the training data and feature extraction will be very critical to obtaining high accuracy on the test data and the validation data.
Another crucial part of the problem will be feature representation as it might dictate how well the classifier is trained and hence its performance will be affected.
Choice of the classifer will again lead to varying performance of the test data and validation data.
- Every team can have a maximum of 2 members only.
- The team must register anytime during the event duration with valid email-ids of its members and a teamname(this will be used for display in the leaderboard). Please note email-ids will not be disclosed online. However the winning team member names will be disclosed after the event has ended.
- No participant is allowed to participate in multiple teams. If a person participates in multiple teams, then those teams will be disqualified.
- After registration, no change in team composition is allowed.
- The event will be open for two weeks from 1st March to 15th March 11:59pm IST (GMT+5:30 hours).
- The test data will be available on 15th March onwards. Final submission has to be made before 15th March 11:59pm IST (GMT+5:30 hours).
- Trial submissions (that is uploading labels on validation data) will continue till the event end. For the final submission on March 15th, multiple submissions can be done but only the latest will be considered. Please follow the ‘Dates’ and the ‘Portal’ tab for updates.
- To be eligible for winning prize, the participating team should upload a writeup(maximum 2-page describing the algorithm and implementation) as well as the perfectly working code with proper documentation during the final submission. The details about submission format will be provided with the dataset in the ‘Portal’ tab.
- Please note classification accuracy alone will not be used for judging winners. The writeup and properly documented code will also be used to decide winners.