Skip to content

A way to deal with Schei* encoding - deal with "Non-ISO extended-ASCII"

We had, again, some issues with encoding.
*file* returns an output like "Non-ISO extended-ASCII". This time, I created a basic step sequence here.
At the end, it really is an brute force approach. And we are using heavily a lot of open source software (thanks again duds!). Furthermore, the sequence steps are based on this post from superuser.com.

# create a list with supported encodings
iconv --list | sed 's/\/\/$//' | sort > list_with_supported_encodings.txt
# iterate over the list of know encodings and try to encode the file with it
LOCAL_SUPPORTED_ENCODING_FILE_PATH='list_with_supported_encodings.txt'
LOCAL_RESULT_FILE_PATH='result.txt'

for LOCAL_ENCODING in `cat $LOCAL_SUPPORTED_ENCODING_FILE_PATH`; do
    printf "$LOCAL_ENCODING  "
    iconv -f $LOCAL_ENCODING -t UTF-8 2016-02-08_UPLOAD_CSV.csv.stev > /dev/null 2>&1 && echo "ok: $LOCAL_ENCODING" || echo "fail: $LOCAL_ENCODING"
# uncomment line below if you want to see the result and put it into the file
#done | tee $LOCAL_RESULT_FILE_PATH
# put the output into the file
done | cat > $LOCAL_RESULT_FILE_PATH
# filter only the successful tryouts
LOCAL_RESULT_FILE_PATH='result.txt'

cat $LOCAL_RESULT_FILE_PATH | grep 'ok:' > 'only_ok_'$LOCAL_RESULT_FILE_PATH
Now comes the hard work, you have to give it a try for each "ok" result in the fitting file.
# read the result file with the ok content and create a encoded version of your broken file
LOCAL_BROKEN_FILE_PATH='relative/or/full/qualified/file/name.txt'
LOCAL_RESULT_FILE_PATH='only_ok_result.txt'

# sed -e 's/^\(.*\)\ \ ok\(.*\)/\1/p' means
# remove any kind of content starting with '  ok:' on each line
# assmed a line looks like "S2  ok: WS2", the result will look like "WS2"

for LOCAL_ENCODING in `cat $LOCAL_RESULT_FILE_PATH | grep ok | sed -e 's/^\(.*\)\ \ ok\(.*\)/\1/p' | uniq; do
    LOCAL_CONVERTED_FILE_PATH=$LOCAL_ENCODING'_'$LOCAL_BROKEN_FILE_PATH
    #echo $LOCAL_CONVERTED_FILE_PATH
    iconv -f CP850 -t UTF-8 $LOCAL_BROKEN_FILE_PATH > $LOCAL_CONVERTED_FILE_PATH
done
Open each file and check if your fitting special characters are looking good. "WINDOWS-1258" and "CP850" are good blind guesses here.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

No comments

The author does not allow comments to this entry

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.
To leave a comment you must approve it via e-mail, which will be sent to your address after submission.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

Markdown format allowed
Form options