Context Navigation

typo.rst @ 12

Last change on this file since 12 was 12, checked in by pinsard, 9 years ago
describe how to check typo on source files
File size: 8.4 KB

Typo

Typo may be hard to find in all sources (code, documentation, tools) files.

:command:`hunspell` help to detect those mistakes.

:command:`pylint` can also be used for Python source.

They use dictionaries providing natural languages list of words.

Here is the flow

.. blockdiag:: typo_blockdiag.dot

Missing words in dictionaries by categories

System dictionaries lack some scientific words (ex barocline), computer languages reserved words, acronyms (ex. IGCMG) and code variables names.

Those missing words can be listed in specific files.

In the directory :file:`docs/manual/for_typo/`, there are :file:`*.txt` files containing words for a specific category.

For example missing scientific words can be added in :file:`jargon.txt`

There is also a file :file:`type.aff` which will be used by :command:`hunspell`.

Build one supplemental dictionary

Supplemental dictionaries can be joined in one and be added to the list of dictionaries used in an spelling check of via :command:`hunspell`.

They must be encoded in UTF-8.

listf=$(find ${PROJECT}/docs/manual/for_typo -name "*.txt")
nontypo=${PROJECT_LOG}/nontypo
nontypo_uniq=${PROJECT_LOG}/nontypo_uniq
rm -f ${nontypo}  ${nontypo_uniq}
for onefile in ${listf}
do
   cat ${onefile} >> ${nontypo}
done
sort -u ${nontypo} | sort --ignore-case > ${nontypo_uniq}

The list ${nontypo_uniq} can also be used to check for typo in documentation files and source code.

First alter the list of variable to produce a :file:`.dic` file which can be used by :command:`hunspell` (i.e. add the number of lines at the top)

nontypo_uniq_dic=${PROJECT_LOG}/nontypo_uniq.dic
linecount=$(wc -l < ${nontypo_uniq})
sed "1i ${linecount}" ${nontypo_uniq} > ${nontypo_uniq_dic}

Associated :file:`nontypo_uniq.aff` file already exists in :file:`${PROJECT}/docs/manual/for_typo`:

ln -s ${PROJECT}/docs/manual/for_typo/nontypo_uniq.aff ${PROJECT_LOG}/nontypo_uniq.aff

Now we have :file:`${PROJECT_LOG}/nontypo_uniq.dic` and :file:`${PROJECT_LOG}/nontypo_uniq.aff` usable for :command:`hunspell`.

???

Check typo in files

.. todo::
   find why -p option of hunspell is not ok. now we have to execute this command
   ${PROJECT_LOG} where .dic and .aff files are located.

Check typo in wiki pages

Until :ref:`tracwiki_migration` is not achieved, we have to check typo in HTML files produced by trac on the http://forge.ipsl.jussieu.fr/igcmg_doc/wiki/

By convention, it has been decided to start all |igcmg_doc| pages names by Doc but there is exception like Train, WikiStart, etc.

To get all hand written wiki pages URI of http://forge.ipsl.jussieu.fr/igcmg_doc/ [1]:

excluded_href=DocYgraphvizLibigcmprod
list_href=${PROJECT_LOG}/list_href
xsltproc \-\-novalid \
${PROJECT}/docs/manual/for_tracwiki/titleindex.xsl http://forge.ipsl.jussieu.fr/igcmg_doc/wiki/TitleIndex | \
grep "/igcmg_doc/wiki/" | grep -v "?action" | grep -v "?format" | \
grep -v ${excluded_href} | sort -u | \
sed -e "s@^@http://forge.ipsl.jussieu.fr@" > ${list_href}
trac_uri=${PROJECT_LOG}/trac_uri
sed -e "s@^@http://forge.ipsl.jussieu.fr/igcmg_doc/wiki/@"  ${PROJECT}/docs/manual/for_tracwiki/tracpages.txt | sort -u > ${trac_uri}
list_uri=$(comm  -13 ${trac_uri} ${list_href})

[1]	we exclude DocYgraphvizLibigcmprod because Trac detected an internal error ... some graphviz trac plugins issue

To download those URI locally:

dirhtml=${PROJECT_LOG}/html/
rm -fr ${dirhtml}
mkdir ${dirhtml}
for uri in ${list_uri}
do
    wget -P ${dirhtml} ${uri}
done

We can now check typo in the HTML files

cd ${PROJECT_LOG}
listf=$(find ${dirhtml} -type f)
hunspell_out=${PROJECT_LOG}/hunspell_out
hunspell_out_uniq=${PROJECT_LOG}/hunspell_out_uniq
rm -f ${hunspell_out}  ${hunspell_out_uniq}
for onefile in ${listf}
do
   LC_ALL=C;hunspell -d en_US,nontypo_uniq --check-url -i utf-8 -l < ${onefile} >> ${hunspell_out}
done
sort -u ${hunspell_out} | sort --ignore-case > ${hunspell_out_uniq}

Warning

side effect of LC_ALL=C; how to avoid LC_ALL change after execution execution

:file:`${hunspell_out_uniq}` contains :

typo to be fixed in wiki pages
false positive to be added in a :file:`docs/manual/for_typo/`
?
false positive to be ignored because too hard to add (encoding issue for Greek word, etc.)

Warning

The following command print only lines present in both ${hunspell_out_uniq} ${nontypo_uniq}.

comm --nocheck-order -12 ${hunspell_out_uniq} ${nontypo_uniq}

If not empty, the supplemental dictionary has not being used by :command:`hunspell`

.. todo::
   give some ideas (LC, ?)

To find one of the wrong spelling in downloaded HTML pages:

w=amonch # take a real one from ${hunspell_out_uniq}
find ${dirhtml} -type f -exec grep -Hi ${w} {} \;

Note

It is also possible to find wrong spelling via the search facility on the trac interface but results may differ (case sensitivity, trac plugins)

Warning

Correction have to be done via the wiki interface of the forge.

Check typo in attached PDF on wiki pages

To get all the URI of attached files:

listf=$(find ${dirhtml} -type f)
list_attached=${PROJECT_LOG}/list_attached
list=${PROJECT_LOG}/list
rm -f ${list_attached}
for onefile in ${listf}
do
   xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -t -m "//x:a[@title='Download']" -v "concat('http://forge.ipsl.jussieu.fr/',@href)" -n ${onefile} >> ${list}
done
sort -u ${list} >  ${list_attached}

To isolated PDF files among these URI:

list_pdf=$(grep "\.pdf$" ${list_attached})

To download those URI locally:

dirpdf=${PROJECT_LOG}/pdf/
rm -rf ${dirpdf}
for uri in ${list_pdf}
do
    wget -P ${dirpdf} ${uri}
done

We can know convert these PDF files to text files

list_pdf=$(find ${dirpdf} -type f)
for pdf in ${list_pdf}
do
    pdftotext ${pdf} ${pdf}.txt
done

We can now check typo in the text files

cd ${PROJECT_LOG}
listf=$(find ${dirpdf} -type f -name "*.pdf.txt")
hunspell_out=${PROJECT_LOG}/hunspell_out
hunspell_out_uniq=${PROJECT_LOG}/hunspell_out_uniq
rm -f ${hunspell_out}  ${hunspell_out_uniq}
for onefile in ${listf}
do
   LC_ALL=C;hunspell -d en_US,nontypo_uniq --check-url -i utf-8 -l < ${onefile} >> ${hunspell_out}
done
sort -u ${hunspell_out} | sort --ignore-case > ${hunspell_out_uniq}

:file:`${hunspell_out_uniq}` contains :

typo to be fixed in PDF files
false positive to be added in a :file:`docs/manual/for_typo/`
?
false positive to be ignored because bad conversion from PDF to text (ie ligature), too hard to add (encoding issue for Greek word, etc.)

To find one of the wrong spelling in converted PDF files:

w=infrastucture # take a real one from ${hunspell_out_uniq}
find ${dirpdf} -type f -name "*.pdf.txt" -exec grep -Hi ${w} {} \;

Check typo in :file:`libIGCM` source files

Assuming modipsl has been downloaded following https://forge.ipsl.jussieu.fr/igcmg_doc/wiki/DocCinstall#Description in :envvar:`MODIPSL`.

We can check typo in source files

cd ${PROJECT_LOG}
listf=$(find ${MODIPSL} -path '*/.svn' -prune -o -type f)
hunspell_out=${PROJECT_LOG}/hunspell_out
hunspell_out_uniq=${PROJECT_LOG}/hunspell_out_uniq
rm -f ${hunspell_out}  ${hunspell_out_uniq}
for onefile in ${listf}
do
   LC_ALL=C;hunspell -d en_US,nontypo_uniq --check-url -i utf-8 -l < ${onefile} >> ${hunspell_out}
done
sort -u ${hunspell_out} | sort --ignore-case > ${hunspell_out_uniq}

:file:`${hunspell_out_uniq}` contains :

typo to be fixed in source files
false positive to be added in a :file:`docs/manual/for_typo/`
?
false positive to be ignored because encoding issue

To find one of the wrong spelling in sources files:

w=destionation # take a real one from ${hunspell_out_uniq}
find ${MODIPSL} -path '*/.svn' -prune -o -type f -exec grep -Hi ${w} {} \;

Warning

Correction have to be done in working space and committed

Docutils System Messages

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/docs/manual/source/developers/guides/typo.rst @ 12

Typo

Missing words in dictionaries by categories

Build one supplemental dictionary

Check typo in files

Check typo in wiki pages

Check typo in attached PDF on wiki pages

Check typo in :file:`libIGCM` source files

Docutils System Messages

Download in other formats: