Using the first 10000 examples of each dataset (or 6192 for Kin which is too small), we trained different models using various values of q, from 2 to 100. We used a fixed cache size of 100Mb and turned on the shrinking, but did not use the sparse mode. The optimizer used to solve the subproblems of size q>2 was a conjugate gradient method with projection11. TABLE 2 gives the results of these experiments. It is clear that q=2 is always faster than any other value of q. Thus, in the following experiments, we have always used q=2.