bash - compare big xml files - get differences
The following script provides a solution to compare two big xml files. I tried to compare a lot of xml files with a size of greater 500 megabytes with different tools. Each tool was eating up my memory and swap and finally crashed. All i want to have is "show me what is in file one and not in file two and vice versa". I've reached this goal by using a property my xml files have. Each file as nodes. Each node has a unique identifier inside. I cutting out the unique identifier tag and putting this tag, line by line, into a file. After that, i'm sorting this unique identifiers. Finally i am using diff. To create a more useful output, i'm separating the "what is only in file one" into a own file (and the same for file two).
Happy using and if you find errors, i'm ready to fix them :-).
#!/bin/bash #### # script to compare two xml files by (unique) tag #### # @author stev leibelt # @since 2013-03-13 #### if [[ $# -eq 3 ]]; then XML_FILE_ONE="$1" XML_FILE_TWO="$2" XML_TAG="$3" if [[ -f "$XML_FILE_ONE" && -f "$XML_FILE_TWO" && ! -z "$XML_TAG" ]]; then #retrieving xml_tags per file #reduce xmls by lines containing the tag sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_ONE > $XML_FILE_ONE'.sed' sed -n -e 's/.*<'$XML_TAG'>\(.*\)<\/'$XML_TAG'>.*/\1/p' $XML_FILE_TWO > $XML_FILE_TWO'.sed' #sort and uniq the sed'ed files sort $XML_FILE_ONE'.sed' | uniq > $XML_FILE_ONE'.sort' sort $XML_FILE_TWO'.sed' | uniq > $XML_FILE_TWO'.sort' #output the differences diff $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff' #diff --side-by-side $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.diff' #comm -3 $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' > 'xml_diff_by_tag.comm' #show only differences per file sed -n -e 's/^<\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_ONE'.diff.uniq' sed -n -e 's/^>\ \(.*\)/\1/p' 'xml_diff_by_tag.diff' > $XML_FILE_TWO'.diff.uniq' #sed -n -e 's/^<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_ONE'.comm.uniq' #sed -n -e 's/\t<\(.*\)/<\1/p' 'xml_diff_by_tag.comm' > $XML_FILE_TWO'.comm.uniq' #removing unused files rm -fr $XML_FILE_ONE'.sed' $XML_FILE_TWO'.sed' $XML_FILE_ONE'.sort' $XML_FILE_TWO'.sort' else echo 'Invalid arguments provided' echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag' fi else echo 'Invalid number of arguments provided' echo 'try '$0' $xmlFileOne $xmlFileTwo $comparingTag' fi
Available on github.com.