evaluation complexity increases when:
*task is more subjective
*you have more coders
*you have more categories
*you have more data to code
chance-adjusted agreement (as opposed to percentage agreement) metrics typically aim to consider these factors in their calculation (though there’s no easy way to quantify “subjectivity” except after the calculation; a low score can be of indicative of the subjective nature of the evaluation task).