About a year ago I worked with generated WAV files, there were several thousand of them. I tried to tag them, sort them into folders, create metadata. In the process, I listened to a few pieces, and, to my chagrin, it turned out that they all begin with a rather long silence. It was very annoying, especially when you listen to a series of files in a row and constantly stumble over pauses before playing each next one. Great, so you have to do something about it too.
I had already spent some time looking for solutions to remove silence from files when it suddenly dawned on me: this is WAV! The data in WAV files is usually PCM audio, that is, each value in the file specifies the amplitude of the sound at some point in time. Accordingly, if we really have complete silence there, and not white noise, then solid zeros should correspond to this silence in the file, right?
$ xxd testfile1.wav | head -n 100
00000000: 5249 4646 64b9 0e00 5741 5645 666d 7420 RIFFd...WAVEfmt
00000010: 1000 0000 0100 0200 44ac 0000 10b1 0200 ........D.......
00000020: 0400 1000 6461 7461 40b9 0e00 0000 0000 ....data@.......
00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................
# ... and a lot more zeros below
And there is. Well, that means it's easier than it seemed. It is enough just to read the files, find the place where these zeros end, and remove the corresponding fragment.
How WAV files are read
First, I needed to become more familiar with the WAV format in order to understand how to work with such files and manage the data within them. I have collected several sources; one of the most useful turned out to be the old page from stanford.edu (the site is no longer available, but, fortunately, it has survived on the Wayback Machine). There was a very clear diagram:
So, the structure of the WAV file seems to be quite simple: first, a 44-byte header, and then the actual data. With this information, it was already possible to start the code. It was only required to skip the first 44 bytes, remove the sequence of zeros at the beginning of the data section, and send everything else for playback in its original form. Although I can not help but add that in another source I came across the following information:
“Some programs assume (and this is very naive on their part) that the preamble in the header is always exactly 44 bytes (as stated in the table above) and that the rest of the file is audio data only. It is not safe to make such assumptions. "
Well, I decided that it was okay: I wrote the program in C, so there was no need to worry too much about security.
The code
The code was uncomplicated, in less than a hundred lines. In fact, he went through the entire file byte by byte, except for the first forty-four, and counted consecutive zeros. As soon as it came across something that was not zero, the program would stop, save the appropriate index, and start reading the file from the beginning. This time, it skipped everything that precedes the index (not counting the header) and output all other bytes according to the standard scheme.
There is no need to cite the entire code, but here is the part that will interest us:
// index was calculated above to be the index of
// the last consecutive zero byte
FILE *f = fopen(argv[1], "rb");
int ind = 0;
int current_byte;
while ((current_byte = fgetc(f)) != EOF) {
if (ind < 44 || ind >= index) {
fputc(current_byte, stdout);
}
ind += 1;
}
fclose(f);
Everything is cool, everything is simple. It's time to test. I ran the program on one of the files with a particularly long pause.
./strip_audio testfile1.wav > testfile1.nosilence.wav
Checked what xxd produces for testfile1.nosilence.wav. Great, no leading zeros. So it worked. To be sure, I'll quickly open the file in my audio player.
Source
Immediately, the most powerful static noise I have heard in my life hit me in the ears. I almost tumbled out of the chair and tried desperately to pull off my headphones. I remember it was in the middle of the night, and the dog came running to check what was wrong with me.
Where did I go wrong?
My ears were still ringing, and I sat and tried to comprehend my rash decisions.
- Mistake number 1: it was necessary to turn down the sound.
- Mistake # 2: you shouldn't have been wearing headphones.
- Mistake # 3: unrecorded unit.
Have you noticed the third error in the code I gave above? Hint: look at the comment. I calculated the variable index as the index of the last byte representing zeros. This means, minus 44 bytes of the header, now we only reproduce what follows or overlaps with the index. index is at the last zero in the series, that is, we include one extra zero byte in the data section.
This can be fixed as follows:
// replaced >= with just >
if (ind < 44 || ind > index) {
fputc(current_byte, stdout);
}
Now there are no extra zeros in the output, and if you play the file, nothing bad will happen. I fixed everything ... But stop.
In WAV files, we have PCM audio, and zeros in this kind of audio data correspond to complete silence. So shouldn't this extra byte be completely silent? Why was it so loud and so static?
First, let's compare a normal audio file with the monster I created with Audacity:
Guess where the monster is? Yes, this is the one in which the amplitude is stably turned out almost to the maximum. Why is that?
How audio samples are read
I went back to the sources I had selected and tried to figure out how an error of one unit could lead to such an explosion in amplitude. I knew that in my files the sample contains 16 bits, and there are two channels (stereo), so I started looking for the appropriate information. Here's what I said in the section on 16-bit stereo PCM audio:
“Each sample is contained in an integer i, which represents the minimum sufficient number of bytes to store a given sample size. The least significant byte is placed first in the store. "
“The minimum enough number of bytes to store a given size” - the wording here is unnecessarily confusing. i corresponds to the number of bits contained in the sample. In our case, there are sixteen of them. Accordingly, if we have a certain value with a length of 16 bits, of course, it will be stored in two bytes. And then an important point: the least significant of the bytes is located in the storage first. Here it is.
Take a look at the graph I made to show what caused such a strong signal:
The top part shows my monster file, in which I accidentally left an extra byte with zeros. Each of the three samples - s1, s2 and s3 - contains two bytes, and the second is more significant. Therefore, when converting these pairs of bytes to decimal, we get a very high amplitude.
At the same time, at the bottom, you can see that if you remove the zero byte, the samples are read as they should, and the values in the audio file are within reasonable limits.
It turns out that if I had 8-bit audio, then the missing extra byte would not cause any problems. But it was 16-bit, and as a result, I shifted the entire sequence in samples, so that the least significant byte was read as the most significant.
conclusions
- Check the sound wave of an audio file before playing it at maximum volume
- ( )
- ,