source:
trunk/docs/manual/source/developers/guides/typo.rst
@
12
Last change on this file since 12 was 12, checked in by pinsard, 9 years ago | |
---|---|
File size: 8.4 KB |
Typo
Typo may be hard to find in all sources (code, documentation, tools) files.
:command:`hunspell` help to detect those mistakes.
:command:`pylint` can also be used for Python source.
They use dictionaries providing natural languages list of words.
Here is the flow
.. blockdiag:: typo_blockdiag.dot
Missing words in dictionaries by categories
System dictionaries lack some scientific words (ex barocline), computer languages reserved words, acronyms (ex. IGCMG) and code variables names.
Those missing words can be listed in specific files.
In the directory :file:`docs/manual/for_typo/`, there are :file:`*.txt` files containing words for a specific category.
For example missing scientific words can be added in :file:`jargon.txt`
There is also a file :file:`type.aff` which will be used by :command:`hunspell`.
Build one supplemental dictionary
Supplemental dictionaries can be joined in one and be added to the list of dictionaries used in an spelling check of via :command:`hunspell`.
They must be encoded in UTF-8.
listf=$(find ${PROJECT}/docs/manual/for_typo -name "*.txt") nontypo=${PROJECT_LOG}/nontypo nontypo_uniq=${PROJECT_LOG}/nontypo_uniq rm -f ${nontypo} ${nontypo_uniq} for onefile in ${listf} do cat ${onefile} >> ${nontypo} done sort -u ${nontypo} | sort --ignore-case > ${nontypo_uniq}
The list ${nontypo_uniq} can also be used to check for typo in documentation files and source code.
First alter the list of variable to produce a :file:`.dic` file which can be used by :command:`hunspell` (i.e. add the number of lines at the top)
nontypo_uniq_dic=${PROJECT_LOG}/nontypo_uniq.dic linecount=$(wc -l < ${nontypo_uniq}) sed "1i ${linecount}" ${nontypo_uniq} > ${nontypo_uniq_dic}
Associated :file:`nontypo_uniq.aff` file already exists in :file:`${PROJECT}/docs/manual/for_typo`:
ln -s ${PROJECT}/docs/manual/for_typo/nontypo_uniq.aff ${PROJECT_LOG}/nontypo_uniq.aff
Now we have :file:`${PROJECT_LOG}/nontypo_uniq.dic` and :file:`${PROJECT_LOG}/nontypo_uniq.aff` usable for :command:`hunspell`.
Check typo in files
.. todo:: find why -p option of hunspell is not ok. now we have to execute this command ${PROJECT_LOG} where .dic and .aff files are located.
Check typo in wiki pages
Until :ref:`tracwiki_migration` is not achieved, we have to check typo in HTML files produced by trac on the http://forge.ipsl.jussieu.fr/igcmg_doc/wiki/
By convention, it has been decided to start all |igcmg_doc| pages names by Doc but there is exception like Train, WikiStart, etc.
To get all hand written wiki pages URI of http://forge.ipsl.jussieu.fr/igcmg_doc/ [1]:
excluded_href=DocYgraphvizLibigcmprod list_href=${PROJECT_LOG}/list_href xsltproc \-\-novalid \ ${PROJECT}/docs/manual/for_tracwiki/titleindex.xsl http://forge.ipsl.jussieu.fr/igcmg_doc/wiki/TitleIndex | \ grep "/igcmg_doc/wiki/" | grep -v "?action" | grep -v "?format" | \ grep -v ${excluded_href} | sort -u | \ sed -e "s@^@http://forge.ipsl.jussieu.fr@" > ${list_href} trac_uri=${PROJECT_LOG}/trac_uri sed -e "s@^@http://forge.ipsl.jussieu.fr/igcmg_doc/wiki/@" ${PROJECT}/docs/manual/for_tracwiki/tracpages.txt | sort -u > ${trac_uri} list_uri=$(comm -13 ${trac_uri} ${list_href})
[1] | we exclude DocYgraphvizLibigcmprod because Trac detected an internal error ... some graphviz trac plugins issue |
To download those URI locally:
dirhtml=${PROJECT_LOG}/html/ rm -fr ${dirhtml} mkdir ${dirhtml} for uri in ${list_uri} do wget -P ${dirhtml} ${uri} done
We can now check typo in the HTML files
cd ${PROJECT_LOG} listf=$(find ${dirhtml} -type f) hunspell_out=${PROJECT_LOG}/hunspell_out hunspell_out_uniq=${PROJECT_LOG}/hunspell_out_uniq rm -f ${hunspell_out} ${hunspell_out_uniq} for onefile in ${listf} do LC_ALL=C;hunspell -d en_US,nontypo_uniq --check-url -i utf-8 -l < ${onefile} >> ${hunspell_out} done sort -u ${hunspell_out} | sort --ignore-case > ${hunspell_out_uniq}
Warning
side effect of LC_ALL=C; how to avoid LC_ALL change after execution execution
:file:`${hunspell_out_uniq}` contains :
typo to be fixed in wiki pages
false positive to be added in a :file:`docs/manual/for_typo/`
false positive to be ignored because too hard to add (encoding issue for Greek word, etc.)
Warning
The following command print only lines present in both ${hunspell_out_uniq} ${nontypo_uniq}.
comm --nocheck-order -12 ${hunspell_out_uniq} ${nontypo_uniq}
If not empty, the supplemental dictionary has not being used by :command:`hunspell`
.. todo:: give some ideas (LC, ?)
To find one of the wrong spelling in downloaded HTML pages:
w=amonch # take a real one from ${hunspell_out_uniq} find ${dirhtml} -type f -exec grep -Hi ${w} {} \;
Note
It is also possible to find wrong spelling via the search facility on the trac interface but results may differ (case sensitivity, trac plugins)
Warning
Correction have to be done via the wiki interface of the forge.
Check typo in attached PDF on wiki pages
To get all the URI of attached files:
listf=$(find ${dirhtml} -type f) list_attached=${PROJECT_LOG}/list_attached list=${PROJECT_LOG}/list rm -f ${list_attached} for onefile in ${listf} do xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -t -m "//x:a[@title='Download']" -v "concat('http://forge.ipsl.jussieu.fr/',@href)" -n ${onefile} >> ${list} done sort -u ${list} > ${list_attached}
To isolated PDF files among these URI:
list_pdf=$(grep "\.pdf$" ${list_attached})
To download those URI locally:
dirpdf=${PROJECT_LOG}/pdf/ rm -rf ${dirpdf} for uri in ${list_pdf} do wget -P ${dirpdf} ${uri} done
We can know convert these PDF files to text files
list_pdf=$(find ${dirpdf} -type f) for pdf in ${list_pdf} do pdftotext ${pdf} ${pdf}.txt done
We can now check typo in the text files
cd ${PROJECT_LOG} listf=$(find ${dirpdf} -type f -name "*.pdf.txt") hunspell_out=${PROJECT_LOG}/hunspell_out hunspell_out_uniq=${PROJECT_LOG}/hunspell_out_uniq rm -f ${hunspell_out} ${hunspell_out_uniq} for onefile in ${listf} do LC_ALL=C;hunspell -d en_US,nontypo_uniq --check-url -i utf-8 -l < ${onefile} >> ${hunspell_out} done sort -u ${hunspell_out} | sort --ignore-case > ${hunspell_out_uniq}
:file:`${hunspell_out_uniq}` contains :
typo to be fixed in PDF files
false positive to be added in a :file:`docs/manual/for_typo/`
false positive to be ignored because bad conversion from PDF to text (ie ligature), too hard to add (encoding issue for Greek word, etc.)
To find one of the wrong spelling in converted PDF files:
w=infrastucture # take a real one from ${hunspell_out_uniq} find ${dirpdf} -type f -name "*.pdf.txt" -exec grep -Hi ${w} {} \;
Check typo in :file:`libIGCM` source files
Assuming modipsl has been downloaded following https://forge.ipsl.jussieu.fr/igcmg_doc/wiki/DocCinstall#Description in :envvar:`MODIPSL`.
We can check typo in source files
cd ${PROJECT_LOG} listf=$(find ${MODIPSL} -path '*/.svn' -prune -o -type f) hunspell_out=${PROJECT_LOG}/hunspell_out hunspell_out_uniq=${PROJECT_LOG}/hunspell_out_uniq rm -f ${hunspell_out} ${hunspell_out_uniq} for onefile in ${listf} do LC_ALL=C;hunspell -d en_US,nontypo_uniq --check-url -i utf-8 -l < ${onefile} >> ${hunspell_out} done sort -u ${hunspell_out} | sort --ignore-case > ${hunspell_out_uniq}
:file:`${hunspell_out_uniq}` contains :
typo to be fixed in source files
false positive to be added in a :file:`docs/manual/for_typo/`
false positive to be ignored because encoding issue
To find one of the wrong spelling in sources files:
w=destionation # take a real one from ${hunspell_out_uniq} find ${MODIPSL} -path '*/.svn' -prune -o -type f -exec grep -Hi ${w} {} \;
Warning
Correction have to be done in working space and committed