Finding duplicate FLAC albums in your collection


#1

For a variety of reasons I’ve found that over the years as my tagging, filing and backup conventions have changed I’ve sometimes ended up duplicating albums in my collection, principally because of changes to folder names and locations. I had a little time to do some deduping and thought I’d share my methods, 1) so it’s there for me to reference when I need to, and 2) so others could benefit.

What I’m sharing basically culminates in a table listing of folders containing the same number of FLAC files with identical audio content to one or more of the other listed folders. This then leaves you to manually delete the FLAC content in the duplicated album should you wish to (for obvious reasons I’m not automating this part, feel free to do so yourself).

To get the job done I’ve used some simple terminal commands (Linux based, so Windows users will have to find the Windows equivalent) coupled with a little SQLite scripting.

Here goes…

  1. Firstly, from the root folder of your music collection generate a text file produced from listing files, their path and FLAC md5sum using bash terminal:

find -type f -name *.flac -print0 | xargs -0 -n1 metaflac --with-filename --show-md5sum > md5sums.txt

  1. Add the necessary delimiters to be able to import the file into a SQLite table

echo __path:__md5sig:__filename:__dirpath > import.csv && sed ‘:a;N;$!ba;s/\n/::\n/g’ md5sums.txt >> import.csv

  1. Now create a database in SQLite, add a table “audio” having text fields __path, __md5sig, __dirpath and __filename and import the csv into the audio table

  2. Run the necessary SQL scripts:

– derive filename from the full file path
UPDATE audio
SET __filename = [replace](__path, rtrim(__path, [replace](__path, ‘/’, ‘’) ), ‘’);

– derive __dirpath from the full file path
UPDATE audio
SET __dirpath = substr(__path, 1, length(__path) - length([replace](__path, rtrim(__path, [replace](__path, ‘/’, ‘’) ), ‘’) ) );

– create table in which to store concatenated __md5sig for all __dirnames in duplicates table

DROP TABLE IF EXISTS __dirpath_content_concat__md5sig;
CREATE TABLE __dirpath_content_concat__md5sig (
__dirpath TEXT,
concat__md5sig TEXT
);

– populate table with __dirpath and concatenated __md5sig of all files associated with __dirpath (note order by __md5sig to ensure concatenated __md5sig is consistently generated irrespective of physical record sequence).
INSERT INTO __dirpath_content_concat__md5sig (
__dirpath,
concat__md5sig
)
SELECT __dirpath,
group_concat(__md5sig, ‘|’)
FROM audio
GROUP BY __dirpath
ORDER BY __md5sig;

–now write the duplicate records into a separate table listing all __dirname’s that have identical FLAC contents
DROP TABLE IF EXISTS __dirpaths_with_same_content;
CREATE TABLE __dirpaths_with_same_content (
__dirpath TEXT,
concat__md5sig TEXT
);

INSERT INTO __dirpaths_with_same_content (
__dirpath,
concat__md5sig
)
SELECT __dirpath,
concat__md5sig
FROM __dirpath_content_concat__md5sig
WHERE concat__md5sig IN (
SELECT concat__md5sig
FROM __dirpath_content_concat__md5sig
GROUP BY concat__md5sig
HAVING count( * ) > 1
)
ORDER BY concat__md5sig, __dirpath;

– populate table listing directories in which FLAC files should be deleted as they’re duplicates
DROP TABLE IF EXISTS __dirpaths_with_FLACs_to_kill;
CREATE TABLE __dirpaths_with_FLACs_to_kill (
__dirpath TEXT,
concat__md5sig TEXT
);

INSERT INTO __dirpaths_with_FLACs_to_kill (
__dirpath,
concat__md5sig
)
SELECT __dirpath,
concat__md5sig
FROM __dirpaths_with_same_content
WHERE rowid NOT IN (
SELECT min(rowid)
FROM __dirpaths_with_same_content
GROUP BY concat__md5sig
);

The table __dirpaths_with_same_content will list all folders in your collection whose FLAC audio content is identical to one or more folders in your collection. The file is sorted so that folders with identical FLAC audio content are grouped together so as to ease navigating your file system.

The table __dirpaths_with_FLACs_to_kill represents a subset of __dirpaths_with_same_content. Deleting the FLAC content in the corresponding folders will eliminate the duplication. It goes without saying in choosing duplicates to add to the killfile no regard has been had to tagging etc…use contents of __dirpaths_with_FLACs_to_kill at your own risk.

IMPORTANT NOTE: Please do not post questions / concerns regarding FLAC files having different filenames, being housed in different folder names or having different file sizes and thus having incorrect results. FLAC files produced by a FLAC specification compliant FLAC encoder have an internal md5sum of the audio stream - this md5sum remains the same regardless of the level of compression applied and is independent of any metadata present in the FLAC header.


(Ralph Pantuso) #2

I believe that the Windows equivalent is WTF?? :smile:

All kidding aside that is very impressive and I do wish I had the skill set to generate a Windows or DOS equivalent.


#3

The only things you need to figure out are:

  1. how to have Windows walk a directory tree and run metaflac in each folder appending the output into a text file.

  2. How to append :: to the end of every line in the file generated in 1. above

The rest is SQLite which is available on most platforms incl. Windows.


(Ralph Pantuso) #4

You may as well have answered me in Latin for all that I understood. :grinning:

Thanks!


(GaryM) #5

Cool. But for noobs like me I just spent a few bucks on PerfectTunes with it’s deduplication function. This is a program by the maker of dbpoweramp. Does other useful things too.


#6

Yip, I’m aware of it, never tried it though and I wanted something that did only what I want, so my code ignores instances of a track that may be duplicated where the entire album has not been duplicated, so I’ve no risk of deleting tracks from Soundtracks or Compilations.


(GaryM) #7

Good point.