No problem asking, though I’m not sure that I totally understand exactly what you did. There are so many possible measurements and comparisons one can make with speech. However, what I would say is that your approach sounds very similar to the way I worked until three or four years ago — sorry I can only tell you in general terms.
Around 2016/17, I was working on a project, in which the whole team agreed that choosing the right features to extract would be the most important decision, i.e. once we work that out, then we would be able to write hyper-optimised algorithms to extract and process the features needed for some niche application (e.g. recognition, accent morphing, pronunciation training). We even started using neural nets to help make the decision about which features to extract etc.
However, when the penny dropped, e.g. after watching a few YouTube clips, reading a few recent papers (most about image processing rather than audio), we realised that if our network could work out which features to extract, then it can be modified do the actual processing. Think about the way the image research has quickly moved from face recognition, to face modification and deep fakes, and most recently to “inventing” photo-realistic faces of people who have never existed.
Getting my head around this move from using neural nets as an analysis tool, to using them to generate the output, has been one of the most enlightening events of my career. There is always something new to learn. The really amazing thing, though, is that unlike 20 years ago, you can download, retrain and then redesign work, which other people are sharing, and build working systems, long before you’re ready to design your own network.