Répertoire des données ouvertes

Votez

Commentaires

Permalien

Soumis par Rob Davidson le mar 07/03/2017 - 16:45

It might be good to standardize coding on a couple of the fields in the CSV Open Data inventory file:

eligible_for_release field's values and counts

1 164
true 168
True 8300
TRUE 11
Y 10
Yes 386

language field is another example

Permalien

Soumis par open-ouvert le mar 07/03/2017 - 19:29

Rob, here's the response from our systems team:

“Thank you for your comment. It was our intention to standardize this element. However, unfortunately there were issues with its implementation that prevented us from effectively doing so. Moving forward, we will work to ensure that we standardize as many elements as possible, and use controlled vocabularies where applicable. Thanks!”

Regards,
Momin, Open Government team.

Permalien

Soumis par Francis le mar 28/03/2017 - 14:12

I noted under the Transport Canada inventory of 210 items, there was reference to one specific regions surveillance plan. Would all regional surveillance plans not be categorized as a data set?

Permalien

Soumis par Stephen Russett le mer 29/03/2017 - 13:47

Where is the data dictionary for this dataset?

Differences between: Publisher, program alignment architecture, and owner org, owner org title

Permalien

Soumis par open-ouvert le mer 29/03/2017 - 14:09

Hi Stephen, the team is currently working on the data dictionary. Until then, here are the descriptions:

Publisher – Name of the organization primarily responsible for publishing the dataset at the time of the publication (if applicable, i.e. if different than current name).

program alignment architecture - The Program Alignment Architecture (PAA) is an inventory of each organization’s programs. It provides an overview of the organization’s responsibilities.

owner org – the acronym of the GC organization that uploaded the inventory

owner org title – the title of the GC organization that uploaded the inventory

Hope that helps!
Momin, the Open Government team.

Permalien

Soumis par Dave Sampson le sam 06/05/2017 - 14:46

Comments:
I wanted to use this file as a jumping off point to investigate and process The GOC data inventory using automated tools.

Although the CSV can be intelligently handled by Libre Office Calc and likely the same with other spreadsheet software, the inventory is proving to be a pretty messy file preventing other tools and scripting languages from consuming it and causing yet again a lot of time to clean up data.

A lot of problems can be detected just by opening in a simple text editor or doing a 'cat' or 'less' for those on Linux (Mac too?)

I think ensuring that some basic best practices are followed can help clean up this inventory a whole bunch.
* Following the IETF spec for CSV would be a good starting point (https://www.ietf.org/rfc/rfc4180.txt)
* Remove all Carriage Returns (hidden characters) often found in descriptions to ensure "Each record is located on a separate line, delimited by a line break (CRLF)" (IETF 4180) (767 instances of this error according to http://CSVLint.io)
* Use propper quotations, embeded quotes should be of unique types "Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields." (IETF4180)

The sollution for cleaning up this dataset for analytical tools and scripting (especially python), find and replace all instances of "" with ' (182 occurences as of today)

It would be grteat if the dataset at source could be updated so tools like RapidMiner and Orange-Canvas can use the HTTP resource directly instead of downloading.

Although I am sure manmy spreadsheet users won't ever dive this deep, it would be good for GOC Open Data to show leadership in applying best practices especialy the IETF spec 4180

Leveraging tools like https://csvlint.io can certainly help GOC's Open Data Cause (see report on this file bellow).

I have alos captured the issue with Orange Canvas here so they can improve their product: https://github.com/biolab/orange3/issues/2293 , however it would be good to clean up this inventory as well.

Investigation Procedure:
I took a brute force procedure since there were so many best practices that were not used in the creation of this dataset. Many tools can handle common mistakes and annoyances, but we should ensure that data is as clean as possible going in. Garbage in = garbage out

* platform: Ubuntu Linux 16.04
* download CSV
* open in Orange Canvas (https://orange.biolab.si/), issues with python not being able to create indices (too many indices for array)
* likely due to poor formatting of collumns
* created a sample dataset using the first 100 rows which worked, and same with the first 150 but the first 200 and beyond failed to load.
* open in Rapdiminer , 8 warnings on the first 100 rows, various issues with inconsistent collumn formats

* open in text editor like gedit (https://wiki.gnome.org/Apps/Gedit)
* Scroll down first collumn and find many incorrect line starts (see https://www.ietf.org/rfc/rfc4180.txt)
* Tried some manual fixes, but proved to take too long for the whole inventory
ID, Edit
ODI-2016-00018, Removed 2 Carriage Returns (CR)
ODI-2016-00096, Removed 1 CR
ODI-2016-00190, Removed 1 CR
ODI-2016-00214, Removed 4 CR
ODI-2016-00216, Removed 2 CR
ODI-2016-00217, Removed 2 CR
ODI-2016-00219, Removed 3 CR
ODI-2016-00220, Removed 3 CR
* Tried a search and replace for carriage returns, search for '\r' and replace with nothing. But only the character was removed, records were still split across lines
* Went back to Orange Canvas and a text editor to systematically track down the problem record (row)
* Fixed record 'ODI-2016-00190,' at line 202 by removing a Carriage Return (CR) character that is often searchable in text editors using '\r' as an escaped search pattern. no glorry
* Went back to try and fix record 'ODI-2016-00018,' by removing a CR as well. no glorry
* created a sample file with the first 170 rows, worked
* first 190 rows, worked
* Then started removing single rows working backwards from ID 'ODI-2016-00192,'
* Removed 'ODI-2016-00192,', Failed
* Removed ODI-2016-00191, failed
* Removed ODI-2016-00190, failed
* Removed ODI-2016-00319, failed
* Removed ODI-2016-00189 and ODI-2016-00188 (I am getting impatient, but getting closer), still failed
* Removed ODI-2016-00186 and ODI-2016-00187, failed, 2 left to test before we are back to 190 records
* Removed ODI-2016-00185, failed
* Record ODI-2016-00184 might be the issue. lets remove that too just to double check with our other 190 record file. Worked
* Confirmed that record ODI-2016-00184 is giving us issues. but why?
* Read through the record and lets try some things
* Quote the second and third collumns, just because that is a best practice for text, failed
* The third collumn looks pretty messy, when using embedded quotes you should replace one set with single quotes. using double-double quotes is getting pretty silly and in the end caused the issue.
Original text: ,"The ""Areas of Non-Contributing Drainage within Total Gross Drainage Areas of the AAFC Watersheds Project - 2013"" dataset is a geospatial data layer containing polygon features representing the areas within the “total gross drainage areas” of each gauging station of the Agriculture and Agri-Food Canada (AAFC) Watersheds Project that DO NOT contribute to average runoff. A “total gross drainage area” is the maximum area that could contribute runoff for a single gauging station – the “areas of non-contributing drainage” are those parts of that “total gross drainage area” that DO NOT contribute to average runoff. For each “total gross drainage area” there can be none to several unconnected “areas of non-contributing drainage”. These polygons may overlap with those from other gauging stations’ “total gross drainage area”, as upstream land surfaces form part of multiple downstream gauging stations’ “total gross drainage areas”.",
Edited Text: ,"The 'Areas of Non-Contributing Drainage within Total Gross Drainage Areas of the AAFC Watersheds Project - 2013' dataset is a geospatial data layer containing polygon features representing the areas within the “total gross drainage areas” of each gauging station of the Agriculture and Agri-Food Canada (AAFC) Watersheds Project that DO NOT contribute to average runoff. A “total gross drainage area” is the maximum area that could contribute runoff for a single gauging station – the “areas of non-contributing drainage” are those parts of that “total gross drainage area” that DO NOT contribute to average runoff. For each “total gross drainage area” there can be none to several unconnected “areas of non-contributing drainage”. These polygons may overlap with those from other gauging stations “total gross drainage area”, as upstream land surfaces form part of multiple downstream gauging stations’ “total gross drainage areas”.",
Notes:
lets try removing the quote after "stations", failed
Lets try removing double double-quotes ("") and replace with single quotes, SUCCESS!!!!!
* lets open the first 200 records and replace "" with ', success
* open first 500 records and do the same, 32 instances replaced, Success
* Open the first 1000, replace all (67 instances), success
* Try with the whole inventory, replace all (182 instances), Success!!!!!

CSV Lint (report for inventory file: https://csvlint.io/validation/590ddb893036660004000010)
* Kept thinking I would find the issue with this file on my own so I did not run the CSV through CSV lint until the end
* It turns out CSVLint found 767 Errors, 2 warnings and 1 message for this inventory file

Comments
* all text should be enclosed in double quotes separated by commas, the format is pretty inconsistent in this file
* double-double quotes is the issue here, replace "" with '
* This is a pretty messy file, if some basic best practices were followed then it could have taken less time to find the actual problem.

Permalien

Soumis par open-ouvert le lun 08/05/2017 - 14:08

Hi Dave, thank you for your feedback. I have forwarded your comment to the team responsible for the Open Data inventories. Stay tuned for a response!

Momin, the Open Government team.

Permalien

Soumis par open-ouvert le jeu 11/05/2017 - 13:42

Dave, here's their response:

"We're looking into correcting the Content-Type header served and normalizing the embedded newlines within our generated CSV files so that automated tools like csvlint.io shows our CSV files as correct.
If your tools are having trouble processing our embedded newlines and quotes, here's an example script that downloads the dataset and removes those characters: https://gist.github.com/wardi/37e1d9922113a3252071665cda19b0b6

Thanks and I hope this helps!"

Permalien

Soumis par Claudia le lun 24/07/2017 - 20:05

Hi, is it possible to access to the «Nomenclature SH»? How can I refer for example «code_SH : 500790, pays : 556, état : 1000 ... » numbers to actual country names, states names? Thanks

Permalien

Soumis par open-ouvert le lun 23/10/2017 - 14:51

Hello,

Thank you for your comment. We definitely follow the best practices we have learned from our data.gov colleagues, and have therefore decided to apply DCAT as well. We worked with them to ensure our applications aligned, and have applied a mapping on all datasets added on open.canada.ca. You can find this mapping on the right hand side of every dataset record (See the JSON and XML links on the right).

Regards,

Momin
The Open Government team

Permalien

Soumis par Gilles Girard le jeu 15/11/2018 - 14:44

Dave Sampson on Sat, 05/06/2017 expressed multiple issues and the answer was a work around for one of many issues.
Focussing on Open Data Inventory.csv I find cells with too much data
On the line of ref_number ODI-2016-00133, column description_en there are 59 lines, 480 words. The formatting looks like a copy and paste from a text editor like word that was transformed. I see traces of paragraphs, line numbering, tabs, line-breaks.
The essence of csv files is to load the data set to different systems (database, spreadsheet, etc) to evaluate the content. The raw data needs to be solid.
Ultimately I was looking for a data set on IT equipment to validate what I have. The only reference in the document regarding IT equipment are two lines limited to 2 departments and no URL to the actual data.

Is there a clean source of data focussing on IT equipement of all departements?

Ajouter un commentaire