The Lakh MIDI Dataset v0.1 (colinraffel.com) is a large collection of midi files that was accumulated to match with the Million Song Dataset. A subset of this data set, the lakh clean dataset contains more than 17,000 midi files. Most of these files contain a percussion accompaniment. The goal is to develop methods for analyzing the characteristics of the percussion line.
The percussion part plays an important role in the midi file. Channel 9 is used for encoding the percussion line. The midi standard specifies 128 percussion instruments that are accessed through the noteon pitch data byte.
Though there may be 10 or more different percussion instruments playing at a time, the main structure of that part is dictated by three instruments: the bass drum, the snare drum, and the hi hat or equivalents. There are several bass drums and anyone can play this role. Similarly either the acoustic snare or the electric snare can be used. In some case the side stick or hand clap could replaces the role of the snare drum. Finally, the hi hat is used to mark time. In this study, we will ignore the hi hat or equivalent, since it does not vary much.
Percussion manuals focus on teaching various grooves or drum loops -- a repeating pattern of one or more music measures. See FamousDrumBeats or 25 Practical Rock Grooves or 20 Beats and Grooves or Basic Drum Patterns for pdf files on this subject. There are also numerous instructional YouTube videos for example How to Write Drum Parts (for non drummers).
The majority of midi files follow this structure; however, there are many exceptions. For example latin music relies on the conga and bonga drums to produce some complex rhythms. To simplify the analysis in this section, we will restrict ourselves to the above class of the percussion part.
Here is a sample of a percussion part.

This is a drum map representation produced by the midiexplorer application for a small section of the midi file Bill Joel/You May be Right.1.mid. The beat number is indicated on the horizontal scale, and the percussion onsets are indicated by the blue, red, and green vertical line segments.
This extract sounds like this.
It is convenient to assign names such as motown or 4on4 to the various grooves that were identified in the above documents. Unfortunately, many of the grooves that we encountered could not be matched to one of the common grooves.
In order to perform this analysis, it is necessary to encode the groove into a code word explained here. We shall assume that a groove fits in one bar groove consisting of 4 beats. Though there are of course many exceptions to this rule, we can account for them later. A four beat groove is represented by a 32-bit integers split into 4 8-bit bytes. Each 8-bit byte encodes the position of the note onsets of the snare (higher 4-bits) and the bass drum (lower 4-bits). For example, the above groove would be represented by the code word 8:82:8:82. We shall use the hexadecimal number system where 10, 11, 12, 13, 14, and 15 are represented by the letters a,b,c,d,e, and f. The binary representation of hexadecimal 82 is 1000 0010, where spaces are used to separate the two 4-bits. The first 4-bit 1000 indicates a snare note onset occurring in the beginning of the beat. The second 4-bit 0010 indicate a bass drum onset occurring in the middle of the of the same beat. Here are the code of some of grooves named in common drum beats followed by their binary number representation. The binary representation indicates which the position of the drum onset relative to the start of a beat in 1/16 time units.
| code | snare drum | bass drum | groove name |
|---|---|---|---|
| 08:02:08:02 | 0000 0000 0000 0000 | 1000 0010 1000 0010 | bossanova |
| 08:08:08:08 | 0000 0000 0000 0000 | 1000 1000 1000 1000 | 4onFloor |
| 08:80:08:80 | 0000 1000 0000 1000 | 1000 0000 1000 0000 | 4on4 |
| 08:82:02:82 | 0000 1000 0000 1000 | 1000 1000 0010 0000 | bluesRock |
| 08:88:08:88 | 0000 1000 0000 1000 | 1000 1000 1000 1000 | disco |
The analysis of the lakh clean dataset detected almost 80,000 4 beat grooves. Here are the first few starting with the most frequent.
| code | total frequency | file frequency |
|---|---|---|
| 08:80:08:80 | 113995 | 3023 |
| 80:08:80 | 93907 | 20 |
| 08:88:08:88 | 78062 | 1385 |
| 08:82:08:80 | 68833 | 2197 |
| 08:88:08 | 60603 | 6 |
| 88:08:88 | 54200 | 2 |
| 08:82:08 | 54093 | 10 |
| 08:08:08 | 45890 | 41 |
| 82:08:80 | 37438 | 11 |
| 80:08:82 | 36252 | 7 |
| 08:82:08:82 | 31799 | 1453 |
| 08:08:08:08 | 31674 | 1762 |
The frequency column indicates the total number of times the groove with this code was encountered in the database. The file count column indicates the number of midi files where this groove is present. Many of these grooves do not appear to have been assigned name.
Most of these grooves are rare. If we ordered the grooves by decreasing frequency and then plot the frequency of the grooves versus its order in the list, then we have the following plot. Note the vertical scale is logarithmic.

The histogram shows a long tail. Beyond sequence number 10,000 the number of times the groove occurs in the entire dataset is less than 10. These grooves are probably outliers. The onset times of the notes in some of the midi files may not be quantized or dithered, causing some unusual groove patterns. Thus the 8:130:2:128 groove would be split into the 4 beats 8, 130, 2, and 128, and we would extract the distributions of these beats in the midi file and compare these distribution with other midi files.
Some of the groove patterns may fill two measures which would appear as two one bar grooves that occur in alternate bars. There may be minor changes in a groove that occur in random places. These variations would appear as additional drum patterns occurring in the midi file. A typical midi file may have 10 or more grooves where some predominate, and others occurring less frequently. In order to compare drum patterns it may be better to split the grooves into single beats and look at the distribution of the beat patterns.

Here are the first 16
| code | snare | bass | frequency |
|---|---|---|---|
| 8 | 0000 | 1000 | 1520680 |
| 80 | 1000 | 0000 | 821905 |
| 82 | 1000 | 0010 | 438892 |
| 88 | 1000 | 1000 | 297418 |
| 0a | 0000 | 1010 | 280207 |
| 02 | 0000 | 0010 | 264310 |
| 01 | 0000 | 0001 | 138394 |
| 10 | 0001 | 0000 | 105700 |
| 28 | 0010 | 1000 | 96625 |
| 81 | 1000 | 0001 | 92972 |
| 09 | 0000 | 1001 | 80646 |
| 20 | 0010 | 0000 | 77682 |
| a0 | 1010 | 0000 | 62570 |
| 18 | 0001 | 1000 | 52606 |