bash - compare big xml files - get differences
The following script provides a solution to compare two big xml files. I tried to compare a lot of xml files with a size of greater 500 megabytes with different tools. Each tool was eating up my memory and swap and finally crashed. All i want to have is "show me what is in file one and not in file two and vice versa". I've reached this goal by using a property my xml files have. Each file as nodes. Each node has a unique identifier inside. I cutting out the unique identifier tag and putting this tag, line by line, into a file. After that, i'm sorting this unique identifiers. Finally i am using diff. To create a more useful output, i'm separating the "what is only in file one" into a own file (and the same for file two).
Happy using and if you find errors, i'm ready to fix them :-).
#!/bin/bash
####
# script to compare two xml files by (unique) tag
####
# @author stev leibelt
# @since 2013-03-13
####
if [[ $# -eq 3 ]]; then
XML_FILE_ONE="$1"
XML_FILE_TWO="$2"
XML_TAG="$3"
if [[ -f "$XML_FILE_ONE"
&& -f "$XML_FILE_TWO"
&& ! -z "$XML_TAG" ]]; then
#retrieving xml_tags per file
#reduce xmls by lines containing the tag
sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_ONE > $XML_FILE_ONE'.sed'
sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_TWO > $XML_FILE_TWO'.sed'
#sort and uniq the sed'ed files
sort $XML_FILE_ONE'.sed' | uniq > $XML_FILE_ONE'.sort'
sort $XML_FILE_TWO'.sed' | uniq > $XML_FILE_TWO'.sort'
#output the differences
diff $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff'
#diff --side-by-side $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff'
#comm -3 $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.comm'
#show only differences per file
sed -n -e 's/^<\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_ONE'.diff.uniq'
sed -n -e 's/^>\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_TWO'.diff.uniq'
#sed -n -e 's/^<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_ONE'.comm.uniq'
#sed -n -e 's/\t<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_TWO'.comm.uniq'
#removing unused files
rm -fr $XML_FILE_ONE'.sed' $XML_FILE_TWO'.sed' $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort'
else
echo 'Invalid arguments provided'
echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag'
fi
else
echo 'Invalid number of arguments provided'
echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag'
fi
Available on github.com.