Skip to content

A way to deal with Schei* encoding - deal with "Non-ISO extended-ASCII"

We had, again, some issues with encoding.
*file* returns an output like "Non-ISO extended-ASCII". This time, I created a basic step sequence here.
At the end, it really is an brute force approach. And we are using heavily a lot of open source software (thanks again duds!). Furthermore, the sequence steps are based on this post from superuser.com.

# create a list with supported encodings
iconv --list | sed 's/\/\/$//' | sort > list_with_supported_encodings.txt
# iterate over the list of know encodings and try to encode the file with it
LOCAL_SUPPORTED_ENCODING_FILE_PATH='list_with_supported_encodings.txt'
LOCAL_RESULT_FILE_PATH='result.txt'

for LOCAL_ENCODING in `cat $LOCAL_SUPPORTED_ENCODING_FILE_PATH`; do
    printf "$LOCAL_ENCODING  "
    iconv -f $LOCAL_ENCODING -t UTF-8 2016-02-08_UPLOAD_CSV.csv.stev > /dev/null 2>&1 && echo "ok: $LOCAL_ENCODING" || echo "fail: $LOCAL_ENCODING"
# uncomment line below if you want to see the result and put it into the file
#done | tee $LOCAL_RESULT_FILE_PATH
# put the output into the file
done | cat > $LOCAL_RESULT_FILE_PATH
# filter only the successful tryouts
LOCAL_RESULT_FILE_PATH='result.txt'

cat $LOCAL_RESULT_FILE_PATH | grep 'ok:' > 'only_ok_'$LOCAL_RESULT_FILE_PATH
Now comes the hard work, you have to give it a try for each "ok" result in the fitting file.
# read the result file with the ok content and create a encoded version of your broken file
LOCAL_BROKEN_FILE_PATH='relative/or/full/qualified/file/name.txt'
LOCAL_RESULT_FILE_PATH='only_ok_result.txt'

# sed -e 's/^\(.*\)\ \ ok\(.*\)/\1/p' means
# remove any kind of content starting with '  ok:' on each line
# assmed a line looks like "S2  ok: WS2", the result will look like "WS2"

for LOCAL_ENCODING in `cat $LOCAL_RESULT_FILE_PATH | grep ok | sed -e 's/^\(.*\)\ \ ok\(.*\)/\1/p' | uniq; do
    LOCAL_CONVERTED_FILE_PATH=$LOCAL_ENCODING'_'$LOCAL_BROKEN_FILE_PATH
    #echo $LOCAL_CONVERTED_FILE_PATH
    iconv -f CP850 -t UTF-8 $LOCAL_BROKEN_FILE_PATH > $LOCAL_CONVERTED_FILE_PATH
done
Open each file and check if your fitting special characters are looking good. "WINDOWS-1258" and "CP850" are good blind guesses here.

web - The great thing about URL encodings is that there are so many to choose from

all 5 comments sorted by: best [–]harlows_monkeys 9 points 2 years ago I'm amazed at how poorly written some of the fundamental web specs are. Here's the heart of the spec for how to encode application/x-www-form-urlencoded data: Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in RFC1738, section 2.2: Non-alphanumeric characters are replaced by%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A'). First problem: it says to do space to '+', then do the %HH encoding. So given the input "a b+c", the first step would give "a+b+c" as the input to the second step. Second problem: '+' is not a reserved character in RFCC1738, so after %HH encoding the reserved characters in "a+b+c", we have "a+b+c". Unfortunately, we'd get the same result if the initial input had been "a+b c". Collision! The second problem is resolved by noting upon a second reading the it says reserved characters are escaped as described in RFC1738. That could be interpreted to mean that then escaping is as described in the RFC--not that we are supposed to be using the same set of reserved characters as the RFC uses. The reserved characters for the form encoding are the non-alphanumeric characters. That's a little better--it will give us "a%2Bb%2Bc". Still a collision though. To fix that, we've got to read Space characters are replaced by '+', and then reserved characters are escaped... as meaning space characters are replaced by '+', and then any reserved characters OTHER THAN the '+'s that resulted from the space replacement are escaped. permalink [–][deleted] 1 point 2 years ago HTML FTW! permalinkparent [–]SoPoOneO 1 point 2 years ago Why are spaces in the URL treated differently than spaces in the query string? According to this article spaces become &20 in the first case and + in the second. permalink [–]KayEss 3 points 2 years ago %20 rather than &20, but to answer the actual question, I think the best way of looking at is that the space -> + transform is a bug. Back in the day when this was being put together nobody was using file specifications with spaces in them -- after all, half of the software you'd want to use did all sorts of horrid random things if you had spaces in file names (this is mostly sorted out these days, but there are still some weirdness in dark corners). So although spaces in filenames were rare, spaces in GET submissions were not and it was just felt that %20 was too ugly for something so common so the + was picked as an easier to read alternative. The confusion that this might cause wasn't anticipated though, or if it was then it wasn't given enough weight. These days you can safely use spaces pretty much everywhere, but you cannot use + signs in URLs because too many systems are broken in their handling of them. Even if you correctly encode them as %2B some systems will go so far as to decode that to a + and then replace that with a space which it will re-encode as a %20 before requesting the URL back from your server. Ouch! permalinkparent [–]Porges 1 point 2 years ago Because the query string is x-www-form-urlencoded and then stuck into the URL. This is what the article is about... permalinkparent
source Nothing more to say ;-).