A divider is a logic module that takes two binary numbers and produces their numerical quotient (and optionally, the remainder). The basic structure is a series of subtractions and multiplexers, where the multiplexer uses the result of the subtraction to select the value that gets passed to the next step. The quotient is formed from the bits used to control the multiplexers and the remainder is the result of the last subtraction.
If it is implemented purely combinatorially, then the critical path through all of this logic is quite long (even with carry-lookahead in the subtractors) and the clock cycle must be very slow. What could be done to shorten the clock period without losing the ability to get a result on every clock?
Pretty much any large chunk of combinatorial logic can be pipelined to reduce the clock period. This enables it to produce more results in a given amount of time at the expense of increasing the latency for any particular result.
Divider logic is very easy to pipeline and the number of pipeline stages you can use is fairly arbitrary. You could insert a pipeline register after each subtract-mux pair, or you may choose to do two or more subtract-mux stages per pipeline register You could even go so far as to pipeline the subtracts and the muxes separately (or even pipeline within each subtract) to get the fastest possible clock speed, but this would be rather extreme. The more pipeline registers you use, the shorter the critical path (and the clock period) can be, but you use more resources (the registers). Also, the overall latency goes up, since you need to account for the setup and propagation times of the pipeline registers in the clock period (in addition to the subtract-mux logic delays). This gets multiplied by the number of pipeline stages to compute the total latency.