Multi-scale architectures at a granular level are characterised by separating input features into groups and applying multi-scale feature extractions to the split input features, and thus the correlations among the input features as global information are no longer retained. Moreover, they usually require more input features due to the separation, and therefore, more complexity is introduced. To retain the global information while utilising the advantages of feature-level hierarchical multi-scale architectures, we propose a multi-scale aggregated-dilation architecture (MSAD) to perform hierarchical fusion of features at a layer level, with the integration of dilated convolutions to overcome these issues. To evaluate the model, we integrate it into ResNet, and apply it to a unique dataset, containing over 60,000 fluorescence lifetime endomicroscopic images (FLIM) collected on ex-vivo lung normal/cancerous tissues from 14 patients, by a custom fibre-based FLIM system. To evaluate the performance of our proposal, we use accuracy, precision, recall, and AUC. We first compare our MSAD model with eight networks achieving a superiority over 6%. To illustrate the advantages and disadvantages of multi-scale architectures at layer and feature-level, we thoroughly compare our MSAD model with the state-of-the-art feature-level multiscale network, namely Res2Net, in terms of parameters, scales, and effective convolutions.