Removing the colour filters reduces light loss, but that does not change dynamic range, because dynamic range is full well capacity minus the noise floor = dark noise plus read noise. If the highlights are not clipped, you can use the light not being lost to the colour filters to increase exposure, and then you get less noise in the shadows just above black. But you could do the same thing by using a larger aperture or a slower shutter speed. Conversely, if the highlights are already clipped, and you care about that, you have to reduce aperture or use a faster shutter speed to compensate for the extra light and you don't get the reduced shadow noise. So the conditions under which the extra light is useful are limited.
I was referring to the situation where one photographs people in available light indoors, which is a situation where the scarcity of photons and color spectrum of the light sources are key limiting factors to image quality. You cannot use a larger aperture if you are already wide open and using a borderline slow shutter speed where the slightest subject movement would cause significant blur. This is a very common situation in indoor or night time events. Highlight clipping is not a priority if the subject faces are excessively noisy in the resulting images due to the lack of adequate photons. Reducing the attenuation due to the CFA would alleviate the situation somewhat, arguably significantly. Faster lenses (such as f/0.95, f/1.2 etc.) also have the problem of excessively shallow depth of field in many cases, large weight etc. and high cost, and lack of the ability to zoom which is a serious limitation in many types of event photography. I've sometimes had to go as high as ISO 102400 to register an image at f/2.8 (granted, it was deep inside medieval castle ruins, but there were some artificial lights used by the dancers). In wedding photography I routinely have situations where I am at ISO 25600; I'd much rather have the image quality of ISO 6400 or 12800 and keep the aperture and shutter speed I get from ISO 25600 in those situations. Despite the high sensitivity of cameras such as D6, there are still many situations where there is not enough light for a good enough image and which could be improved.
The increase in resolution with a monochrome sensor is small.
I agree but if the file size and processing times are considered significant limitations, the designers could triple the number of pixels of a monochrome sensor while retaining the same file size as a color sensor camera without imposing a significant penalty on file storage or processing time (raw conversion would be significantly faster than today, which is another advantage). And then the image resolution would be greater.
One reason is that the Bayer process is the process of human vision, and our brains use luminance for detail and add a rather crude colour map on top of that. If you look at the limiting resolutions (MTF50) measured by target reproduction photography and compare them to the Nyquist limit that is the maximum benefit you can get, and with current equipment it is remarkably small. Then there is the fact that ink jet printers are limited to 260 or 300 dpi = 10 or 12 per mm, so the most they can print is 5-6 lp/mm. If that is achievable with a Bayer sensor and a good lens you don't get any benefit of a monochrome sensor's higher resolution in the print.
Yet it's plainly very obvious how the image detail created by resampling from an 8K sensor to a 4K final image results in a noticeable improvement in image detail over a native 4K sensor. This wouldn't happen in the absence of the CFA and possible AA filter. Of course, if aliasing is a problem, then I refer to the above proposal of increasing the monochrome sensor pixel count by a factor of three to reduce the effects of aliasing while retaining the original file size. This is very obvious when inspecting video from native 4K and 8K cameras at 4K resolution.
Where is your inkjet limitation coming from? My Epson P900 prints at 5760 x 1440 dpi. Some landscape photographers using medium format have suggested that improvements are visible up to about 1000 ppi when printing. Others can't see much due to aging vision or otherwise poor eyesight. Personally since I don't use glossy paper, I haven't gotten into such comparisons; matte and semiglossy papers don't show as fine details as glossy. But anyway if the print size is large (or there is a need to reframe by cropping) then there is plenty of opportunity to benefit from higher resolution and MTF. (For landscape, I wouldn't personally be using a black and white sensor.)