Panda Isnot Reading All of the Entries in Column Reading Csv File Python
27. Reading and Writing Data in Pandas
By Bernd Klein. Last modified: 01 Feb 2022.
All the powerful data structures like the Series and the DataFrames would avail to null, if the Pandas module wouldn't provide powerful functionalities for reading in and writing out information. Information technology is not only a matter of having a functions for interacting with files. To exist useful to data scientists it besides needs functions which back up the virtually important data formats like
- Delimiter-separated files, like e.g. csv
- Microsoft Excel files
- HTML
- XML
- JSON
Delimiter-separated Values
Most people take csv files equally a synonym for delimter-separated values files. They leave the fact out of account that csv is an acronym for "comma separated values", which is not the instance in many situations. Pandas besides uses "csv" and contexts, in which "dsv" would be more appropriate.
Delimiter-separated values (DSV) are defined and stored two-dimensional arrays (for example strings) of data by separating the values in each row with delimiter characters divers for this purpose. This way of implementing data is often used in combination of spreadsheet programs, which tin can read in and write out data as DSV. They are also used as a general information exchange format.
Nosotros call a text file a "delimited text file" if it contains text in DSV format.
For instance, the file dollar_euro.txt is a delimited text file and uses tabs (\t) as delimiters.
Reading CSV and DSV Files
Pandas offers two ways to read in CSV or DSV files to be precise:
- DataFrame.from_csv
- read_csv
There is no big divergence betwixt those two functions, e.g. they take different default values in some cases and read_csv has more than paramters. We will focus on read_csv, because DataFrame.from_csv is kept within Pandas for reasons of backwards compatibility.
import pandas equally pd exchange_rates = pd . read_csv ( "/data1/dollar_euro.txt" , sep = " \t " ) print ( exchange_rates )
OUTPUT:
Yr Average Min USD/EUR Max USD/EUR Working days 0 2016 0.901696 0.864379 0.959785 247 1 2015 0.901896 0.830358 0.947688 256 two 2014 0.753941 0.716692 0.823655 255 iii 2013 0.753234 0.723903 0.783208 255 4 2012 0.778848 0.743273 0.827198 256 5 2011 0.719219 0.671953 0.775855 257 6 2010 0.755883 0.686672 0.837381 258 7 2009 0.718968 0.661376 0.796495 256 8 2008 0.683499 0.625391 0.802568 256 9 2007 0.730754 0.672314 0.775615 255 10 2006 0.797153 0.750131 0.845594 255 11 2005 0.805097 0.740357 0.857118 257 12 2004 0.804828 0.733514 0.847314 259 13 2003 0.885766 0.791766 0.963670 255 14 2002 1.060945 0.953562 ane.165773 255 15 2001 one.117587 1.047669 one.192748 255 16 2000 ane.085899 0.962649 i.211827 255 17 1999 0.939475 0.848176 0.998502 261
Equally we can see, read_csv used automatically the start line as the names for the columns. It is possible to give other names to the columns. For this purpose, we have to skip the starting time line by setting the parameter "header" to 0 and we have to assign a list with the column names to the parameter "names":
import pandas every bit pd exchange_rates = pd . read_csv ( "/data1/dollar_euro.txt" , sep = " \t " , header = 0 , names = [ "year" , "min" , "max" , "days" ]) print ( exchange_rates )
OUTPUT:
yr min max days 2016 0.901696 0.864379 0.959785 247 2015 0.901896 0.830358 0.947688 256 2014 0.753941 0.716692 0.823655 255 2013 0.753234 0.723903 0.783208 255 2012 0.778848 0.743273 0.827198 256 2011 0.719219 0.671953 0.775855 257 2010 0.755883 0.686672 0.837381 258 2009 0.718968 0.661376 0.796495 256 2008 0.683499 0.625391 0.802568 256 2007 0.730754 0.672314 0.775615 255 2006 0.797153 0.750131 0.845594 255 2005 0.805097 0.740357 0.857118 257 2004 0.804828 0.733514 0.847314 259 2003 0.885766 0.791766 0.963670 255 2002 ane.060945 0.953562 1.165773 255 2001 1.117587 1.047669 1.192748 255 2000 one.085899 0.962649 1.211827 255 1999 0.939475 0.848176 0.998502 261
Exercise i
The file "countries_population.csv" is a csv file, containing the population numbers of all countries (July 2014). The delimiter of the file is a space and commas are used to split up groups of thousands in the numbers. The method 'head(due north)' of a DataFrame can be used to requite out only the get-go n rows or lines. Read the file into a DataFrame.
Solution:
popular = pd . read_csv ( "/data1/countries_population.csv" , header = None , names = [ "Land" , "Population" ], index_col = 0 , quotechar = "'" , sep = " " , thousands = "," ) print ( popular . head ( 5 ))
OUTPUT:
Population Country Prc 1355692576 India 1236344631 Eu 511434812 United States 318892103 Indonesia 253609643
Writing csv Files
We can create csv (or dsv) files with the method "to_csv". Before nosotros exercise this, we will prepare some data to output, which we will write to a file. We have two csv files with population information for various countries. countries_male_population.csv contains the figures of the male person populations and countries_female_population.csv correspondingly the numbers for the female populations. Nosotros will create a new csv file with the sum:
column_names = [ "State" ] + list ( range ( 2002 , 2013 )) male_pop = pd . read_csv ( "/data1/countries_male_population.csv" , header = None , index_col = 0 , names = column_names ) female_pop = pd . read_csv ( "/data1/countries_female_population.csv" , header = None , index_col = 0 , names = column_names ) population = male_pop + female_pop
2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Country | |||||||||||
Commonwealth of australia | 19640979.0 | 19872646 | 20091504 | 20339759 | 20605488 | 21015042 | 21431781 | 21874920 | 22342398 | 22620554 | 22683573 |
Austria | 8139310.0 | 8067289 | 8140122 | 8206524 | 8265925 | 8298923 | 8331930 | 8355260 | 8375290 | 8404252 | 8443018 |
Kingdom of belgium | 10309725.0 | 10355844 | 10396421 | 10445852 | 10511382 | 10584534 | 10666866 | 10753080 | 10839905 | 10366843 | 11035958 |
Canada | NaN | 31361611 | 31372587 | 31989454 | 32299496 | 32649482 | 32927372 | 33327337 | 33334414 | 33927935 | 34492645 |
Czech republic | 10269726.0 | 10203269 | 10211455 | 10220577 | 10251079 | 10287189 | 10381130 | 10467542 | 10506813 | 10532770 | 10505445 |
Kingdom of denmark | 5368354.0 | 5383507 | 5397640 | 5411405 | 5427459 | 5447084 | 5475791 | 5511451 | 5534738 | 5560628 | 5580516 |
Finland | 5194901.0 | 5206295 | 5219732 | 5236611 | 5255580 | 5276955 | 5300484 | 5326314 | 5351427 | 5375276 | 5401267 |
France | 59337731.0 | 59630121 | 59900680 | 62518571 | 62998773 | 63392140 | 63753140 | 64366962 | 64716310 | 65129746 | 65394283 |
Germany | 82440309.0 | 82536680 | 82531671 | 82500849 | 82437995 | 82314906 | 82217837 | 82002356 | 81802257 | 81751602 | 81843743 |
Greece | 10988000.0 | 11006377 | 11040650 | 11082751 | 11125179 | 11171740 | 11213785 | 11260402 | 11305118 | 11309885 | 11290067 |
Hungary | 10174853.0 | 10142362 | 10116742 | 10097549 | 10076581 | 10066158 | 10045401 | 10030975 | 10014324 | 9985722 | 9957731 |
Republic of iceland | 286575.0 | 288471 | 290570 | 293577 | 299891 | 307672 | 315459 | 319368 | 317630 | 318452 | 319575 |
Ireland | 3882683.0 | 3963636 | 4027732 | 4109173 | 4209019 | 4239848 | 4401335 | 4450030 | 4467854 | 4569864 | 4582769 |
Italy | 56993742.0 | 57321070 | 57888245 | 58462375 | 58751711 | 59131287 | 59619290 | 60045068 | 60340328 | 60626442 | 60820696 |
Japan | 127291000.0 | 127435000 | 127620000 | 127687000 | 127767994 | 127770000 | 127771000 | 127692000 | 127510000 | 128057000 | 127799000 |
Korea | 47639618.0 | 47925318 | 48082163 | 48138077 | 48297184 | 48456369 | 48606787 | 48746693 | 48874539 | 49779440 | 50004441 |
Luxembourg | 444050.0 | 448300 | 451600 | 455000 | 469086 | 476187 | 483799 | 493500 | 502066 | 511840 | 524853 |
Mexico | 101826249.0 | 103039964 | 104213503 | 103001871 | 103946866 | 104874282 | 105790725 | 106682518 | 107550697 | 108396211 | 115682867 |
Netherlands | 16105285.0 | 16192572 | 16258032 | 16305526 | 16334210 | 16357992 | 16405399 | 16485787 | 16574989 | 16655799 | 16730348 |
New Zealand | 3939130.0 | 4009200 | 4062500 | 4100570 | 4139470 | 4228280 | 4268880 | 4315840 | 4367740 | 4405150 | 4433100 |
Norway | 4524066.0 | 4552252 | 4577457 | 4606363 | 4640219 | 4681134 | 4737171 | 4799252 | 4858199 | 4920305 | 4985870 |
Poland | 38632453.0 | 38218531 | 38190608 | 38173835 | 38157055 | 38125479 | 38115641 | 38135876 | 38167329 | 38200037 | 38538447 |
Portugal | 10335559.0 | 10407465 | 10474685 | 10529255 | 10569592 | 10599095 | 10617575 | 10627250 | 10637713 | 10636979 | 10542398 |
Slovak Republic | 5378951.0 | 5379161 | 5380053 | 5384822 | 5389180 | 5393637 | 5400998 | 5412254 | 5424925 | 5435273 | 5404322 |
Spain | 40409330.0 | 41550584 | 42345342 | 43038035 | 43758250 | 44474631 | 45283259 | 45828172 | 45989016 | 46152926 | 46818221 |
Sweden | 8909128.0 | 8940788 | 8975670 | 9011392 | 9047752 | 9113257 | 9182927 | 9256347 | 9340682 | 9415570 | 9482855 |
Switzerland | 7261210.0 | 7313853 | 7364148 | 7415102 | 7459128 | 7508739 | 7593494 | 7701856 | 7785806 | 7870134 | 7954662 |
Turkey | NaN | 70171979 | 70689500 | 71607500 | 72519974 | 72519974 | 70586256 | 71517100 | 72561312 | 73722988 | 74724269 |
United Kingdom | 58706905.0 | 59262057 | 59699828 | 60059858 | 60412870 | 60781346 | 61179260 | 61595094 | 62026962 | 62498612 | 63256154 |
United States | 277244916.0 | 288774226 | 290810719 | 294442683 | 297308143 | 300184434 | 304846731 | 305127551 | 307756577 | 309989078 | 312232049 |
population . to_csv ( "/data1/countries_total_population.csv" )
We want to create a new DataFrame with all the information, i.e. female person, male and consummate population. This means that nosotros have to innovate an hierarchical index. Before we exercise it on our DataFrame, nosotros will introduce this problem in a simple instance:
import pandas every bit pd shop1 = { "foo" :{ 2010 : 23 , 2011 : 25 }, "bar" :{ 2010 : xiii , 2011 : 29 }} shop2 = { "foo" :{ 2010 : 223 , 2011 : 225 }, "bar" :{ 2010 : 213 , 2011 : 229 }} shop1 = pd . DataFrame ( shop1 ) shop2 = pd . DataFrame ( shop2 ) both_shops = shop1 + shop2 print ( "Sales of shop1: \n " , shop1 ) print ( " \n Sales of both shops \n " , both_shops )
OUTPUT:
Sales of shop1: foo bar 2010 23 13 2011 25 29 Sales of both shops foo bar 2010 246 226 2011 250 258
shops = pd . concat ([ shop1 , shop2 ], keys = [ "i" , "two" ]) shops
foo | bar | ||
---|---|---|---|
i | 2010 | 23 | 13 |
2011 | 25 | 29 | |
ii | 2010 | 223 | 213 |
2011 | 225 | 229 |
We want to swap the hierarchical indices. For this we will utilise 'swaplevel':
shops . swaplevel () shops . sort_index ( inplace = True ) shops
foo | bar | ||
---|---|---|---|
one | 2010 | 23 | 13 |
2011 | 25 | 29 | |
two | 2010 | 223 | 213 |
2011 | 225 | 229 |
Nosotros will get back to our initial problem with the population figures. We will apply the same steps to those DataFrames:
pop_complete = pd . concat ([ population . T , male_pop . T , female_pop . T ], keys = [ "total" , "male person" , "female" ]) df = pop_complete . swaplevel () df . sort_index ( inplace = True ) df [[ "Austria" , "Australia" , "French republic" ]]
State | Republic of austria | Australia | France | |
---|---|---|---|---|
2002 | female person | 4179743.0 | 9887846.0 | 30510073.0 |
male | 3959567.0 | 9753133.0 | 28827658.0 | |
total | 8139310.0 | 19640979.0 | 59337731.0 | |
2003 | female | 4158169.0 | 9999199.0 | 30655533.0 |
male | 3909120.0 | 9873447.0 | 28974588.0 | |
total | 8067289.0 | 19872646.0 | 59630121.0 | |
2004 | female person | 4190297.0 | 10100991.0 | 30789154.0 |
male person | 3949825.0 | 9990513.0 | 29111526.0 | |
total | 8140122.0 | 20091504.0 | 59900680.0 | |
2005 | female | 4220228.0 | 10218321.0 | 32147490.0 |
male person | 3986296.0 | 10121438.0 | 30371081.0 | |
total | 8206524.0 | 20339759.0 | 62518571.0 | |
2006 | female | 4246571.0 | 10348070.0 | 32390087.0 |
male person | 4019354.0 | 10257418.0 | 30608686.0 | |
full | 8265925.0 | 20605488.0 | 62998773.0 | |
2007 | female | 4261752.0 | 10570420.0 | 32587979.0 |
male | 4037171.0 | 10444622.0 | 30804161.0 | |
total | 8298923.0 | 21015042.0 | 63392140.0 | |
2008 | female person | 4277716.0 | 10770864.0 | 32770860.0 |
male | 4054214.0 | 10660917.0 | 30982280.0 | |
total | 8331930.0 | 21431781.0 | 63753140.0 | |
2009 | female | 4287213.0 | 10986535.0 | 33208315.0 |
male | 4068047.0 | 10888385.0 | 31158647.0 | |
full | 8355260.0 | 21874920.0 | 64366962.0 | |
2010 | female person | 4296197.0 | 11218144.0 | 33384930.0 |
male | 4079093.0 | 11124254.0 | 31331380.0 | |
total | 8375290.0 | 22342398.0 | 64716310.0 | |
2011 | female person | 4308915.0 | 11359807.0 | 33598633.0 |
male | 4095337.0 | 11260747.0 | 31531113.0 | |
full | 8404252.0 | 22620554.0 | 65129746.0 | |
2012 | female person | 4324983.0 | 11402769.0 | 33723892.0 |
male | 4118035.0 | 11280804.0 | 31670391.0 | |
total | 8443018.0 | 22683573.0 | 65394283.0 |
df . to_csv ( "/data1/countries_total_population.csv" )
Live Python training
Upcoming online Courses
Enrol here
Do 2
- Read in the dsv file (csv) bundeslaender.txt. Create a new file with the columns 'land', 'area', 'female', 'male', 'population' and 'density' (inhabitants per square kilometres.
- print out the rows where the expanse is greater than 30000 and the population is greater than 10000
- Print the rows where the density is greater than 300
lands = pd . read_csv ( '/data1/bundeslaender.txt' , sep = " " ) print ( lands . columns . values )
OUTPUT:
['land' 'area' 'male' 'female']
# swap the columns of our DataFrame: lands = lands . reindex ( columns = [ 'land' , 'expanse' , 'female' , 'male person' ]) lands [: 2 ]
land | expanse | female person | male person | |
---|---|---|---|---|
0 | Baden-Württemberg | 35751.65 | 5465 | 5271 |
one | Bayern | 70551.57 | 6366 | 6103 |
lands . insert ( loc = len ( lands . columns ), column = 'population' , value = lands [ 'female person' ] + lands [ 'male' ])
land | area | female | male person | population | |
---|---|---|---|---|---|
0 | Baden-Württemberg | 35751.65 | 5465 | 5271 | 10736 |
ane | Bayern | 70551.57 | 6366 | 6103 | 12469 |
2 | Berlin | 891.85 | 1736 | 1660 | 3396 |
lands . insert ( loc = len ( lands . columns ), column = 'density' , value = ( lands [ 'population' ] * 1000 / lands [ 'area' ]) . circular ( 0 )) lands [: iv ]
land | area | female | male | population | density | |
---|---|---|---|---|---|---|
0 | Baden-Württemberg | 35751.65 | 5465 | 5271 | 10736 | 300.0 |
1 | Bayern | 70551.57 | 6366 | 6103 | 12469 | 177.0 |
2 | Berlin | 891.85 | 1736 | 1660 | 3396 | 3808.0 |
3 | Brandenburg | 29478.61 | 1293 | 1267 | 2560 | 87.0 |
print ( lands . loc [( lands . expanse > 30000 ) & ( lands . population > 10000 )])
OUTPUT:
land surface area female male population density 0 Baden-Württemberg 35751.65 5465 5271 10736 300.0 1 Bayern 70551.57 6366 6103 12469 177.0 9 Nordrhein-Westfalen 34085.29 9261 8797 18058 530.0
Reading and Writing Excel Files
It is also possible to read and write Microsoft Excel files. The Pandas functionalities to read and write Excel files use the modules 'xlrd' and 'openpyxl'. These modules are not automatically installed by Pandas, so yous may take to install them manually!
We will use a uncomplicated Excel document to demonstrate the reading capabilities of Pandas. The certificate sales.xls contains two sheets, one called 'week1' and the other one 'week2'.
An Excel file can be read in with the Pandas part "read_excel". This is demonstrated in the following example Python code:
excel_file = pd . ExcelFile ( "/data1/sales.xls" ) sheet = pd . read_excel ( excel_file ) sheet
Weekday | Sales | |
---|---|---|
0 | Mon | 123432.980000 |
1 | Tuesday | 122198.650200 |
2 | Wednesday | 134418.515220 |
3 | Thursday | 131730.144916 |
4 | Fri | 128173.431003 |
The document "sales.xls" contains two sheets, but we only have been able to read in the first i with "read_excel". A consummate Excel document, which can consist of an arbitrary number of sheets, can exist completely read in like this:
docu = {} for sheet_name in excel_file . sheet_names : docu [ sheet_name ] = excel_file . parse ( sheet_name ) for sheet_name in docu : impress ( " \north " + sheet_name + ": \n " , docu [ sheet_name ])
OUTPUT:
week1: Weekday Sales 0 Mon 123432.980000 i Tuesday 122198.650200 2 Midweek 134418.515220 3 Thursday 131730.144916 4 Friday 128173.431003 week2: Weekday Sales 0 Monday 223277.980000 one Tuesday 234441.879000 2 Wednesday 246163.972950 3 Thursday 241240.693491 four Fri 230143.621590
We will calculate now the avarage sales numbers of the two weeks:
boilerplate = docu [ "week1" ] . re-create () average [ "Sales" ] = ( docu [ "week1" ][ "Sales" ] + docu [ "week2" ][ "Sales" ]) / 2 print ( average )
OUTPUT:
Weekday Sales 0 Monday 173355.480000 i Tuesday 178320.264600 2 Midweek 190291.244085 iii Thursday 186485.419203 4 Friday 179158.526297
We volition save the DataFrame 'average' in a new document with 'week1' and 'week2' as additional sheets also:
writer = pd . ExcelWriter ( '/data1/sales_average.xlsx' ) document [ 'week1' ] . to_excel ( writer , 'week1' ) document [ 'week2' ] . to_excel ( writer , 'week2' ) average . to_excel ( author , 'average' ) writer . salvage () writer . close ()
Live Python grooming
Upcoming online Courses
Enrol here
Source: https://python-course.eu/numerical-programming/reading-and-writing-data-in-pandas.php
0 Response to "Panda Isnot Reading All of the Entries in Column Reading Csv File Python"
Post a Comment