Automatic speech recognition (ASR) for Vietnamese still suffers severe accuracy loss in everyday acoustic scenes crowded with overlapping speech, popular music, traffic, and office equipment. We address this gap by explicitly modeling noise during acoustic model training. A 200-hour clean-speech corpus is augmented with 2000 hours of authentic Vietnamese noises, mixed at six signal-to-noise ratios (0-15 dB). Noise-only regions are tagged in the transcript, allowing the network to learn dedicated “noise phones”. A TDNN-LSTM trained with lattice-free MMI on this corpus reduces the word error rate by up to 39% on five noisy benchmarks, while preserving and often improving accuracy on clean speech. The findings confirm that explicit noise tags constitute a practical step toward robust Vietnamese ASR in consumer and enterprise products. Future work will refine automatic noise segmentation and couple the approach with more robust language models to further enhance end-to-end accuracy.