Big Data for Science (and other) Studentsrpruim/talks/BigData/JSM2013/BigDataPanel.pdf · Big Data...

30
Where are we now? What is Big Data? Some First Steps Questions Big Data for Science (and other) Students Randall Pruim Calvin College

Transcript of Big Data for Science (and other) Studentsrpruim/talks/BigData/JSM2013/BigDataPanel.pdf · Big Data...

Where are we now? What is Big Data? Some First Steps Questions

Big Data for Science (and other) Students

Randall Pruim

Calvin College

Where are we now? What is Big Data? Some First Steps Questions

Questions to My Colleagues

1. What is the largest data set that students encounter in yourrespective majors?

2. In the context of your disciplines, when someone says “big data”,what do you think of? How big? What application areas?

Where are we now? What is Big Data? Some First Steps Questions

A Chemist Responds

Me: How big is your data?

Chemist: The biggest datasets I use are in my research: about 500 by100 absorbance values.

The biggest I know of in chemistry would probably be chromatographymass spec data, which can be many thousands of Mass spec scans whicheach have 10 – 100 thousand data values. Another possibility would be2D NMR, but I am not sure how big those datasets are.

Me: Follow-up question: What is the biggest data students see in yourclasses?

Chemist: Nada, really

Where are we now? What is Big Data? Some First Steps Questions

A Chemist Responds

Me: How big is your data?

Chemist: The biggest datasets I use are in my research: about 500 by100 absorbance values.

The biggest I know of in chemistry would probably be chromatographymass spec data, which can be many thousands of Mass spec scans whicheach have 10 – 100 thousand data values. Another possibility would be2D NMR, but I am not sure how big those datasets are.

Me: Follow-up question: What is the biggest data students see in yourclasses?

Chemist: Nada, really

Where are we now? What is Big Data? Some First Steps Questions

Another Chemist Responds

Me: How big is your data?

Chemist: I think this is related to one reason why there’s more tractionfor statistics in the curriculum in biology and medicine than in chemistry.I don’t think I’ve ever come across big data as a chemistry student or asa professor in the chemistry that I teach (General and Physical).Certainly we do sometimes have instruments that generate big data files,but that’s just because we have a spectrometer that measures absorbancefor each tenth of a nanometer for an interval of 500 nanometers or wehave a probe that measures temperature twice per second for an hour. Idon’t think these are the sort of data that you have in mind when you saybig data.

I think the closest that we get to big data [as chemists] is in biochem andbioinformatics.

Where are we now? What is Big Data? Some First Steps Questions

A Physicist Responds

Me: How big is your data?

Physicist: By modern standards, this is not ‘big data’, but nowadays I useoscilloscopes which can return 5 columns, each of 500,000 lines. Moretypical would be 2-3 columns and 16k lines. Either way, the data sets aretoo big to ‘cut and paste’, so I’ve actualy learned how to read such filesinto Sage from a computer desktop.

Also, I think you should ask [our astronomer], who uses asteroiddatabases with up to 500k objects, giving > 5 parameters for each one.

Where are we now? What is Big Data? Some First Steps Questions

Questions to My Colleagues

1. What is the largest data set that students encounter in yourrespective majors?

2. In the context of your disciplines, when someone says “big data”,what do you think of? How big? What application areas?

Main Stories:

• There is an enormous difference in scale between classes, studentresearch, and disciplinary biggies.

• Science faculty are only vaguely familiar with really large data sets

• Many faculty never work with anything but very modestly sized data

• Some angst about the approaching big data train

Where are we now? What is Big Data? Some First Steps Questions

Big is when your workflow breaks

Where are we now? What is Big Data? Some First Steps Questions

Big is when your workflow breaks

Physicist: I know my ‘work flow changed’ when data sets of > 64k couldno longer use (former versions of) Excel as a place to copy & paste andthen edit.

Where are we now? What is Big Data? Some First Steps Questions

Peering Around the Bend

Where are we now? What is Big Data? Some First Steps Questions

Is There a Light at the End of the Tunnel?

Where are we now? What is Big Data? Some First Steps Questions

Is Big Data Primarily an HR Problem?

Where are we now? What is Big Data? Some First Steps Questions

Or is it an IT problem?

Where are we now? What is Big Data? Some First Steps Questions

Harnessing the Deluge

Where are we now? What is Big Data? Some First Steps Questions

The CVC

A group of liberal arts colleges have formed a Computation andVisualization Consortium (CVC) to address issues of curriculum andfaculty development.

• Faculty from Math, Stat, Bio, Chem, CS involved

• 2 months into 4-year plan

• Attempting to identify key skills and ways to teach them

• Faculty development already identified as a key component

Where are we now? What is Big Data? Some First Steps Questions

Some CVC First Steps• At St Olaf, Intro Programming is being taught using Python and R

and focusing on data-related programming tasks

• At Macalester, all science students will take a 1-hour DCF course

• At Smith, a data science course is being introduced this fall

• At Calvin, an NSF grant is funding redesign of biology laboratoriesthat make more substantial use of chemistry, mathematics and dataanalysis and physics classes are using sage/python

• Other institutions still in planning phases

• Project MOSAIC is working to make using R easier earlier in thecurriculum

• Institutions with cross-departmental effort are progressing mostquickly

Where are we now? What is Big Data? Some First Steps Questions

DCF at Macalester

Data and Computation Fundamentals

• 1-hour course for all science students

• taught in 7 weekly sessions during first year

Example skills

• RStudio and reproducible analysis (RMarkdown)

• Data: curation; import; tidying and cleaning

• Graphical (and numerical summaries) of data

• Split/Apply/Combine

• Database light (join, merge, groupby)

• Modeling/Fitting with functions and smoothers

Where are we now? What is Big Data? Some First Steps Questions

DCF at Macalester

Where are we now? What is Big Data? Some First Steps Questions

DCF at Macalester

Where are we now? What is Big Data? Some First Steps Questions

DCF at Macalester – Week 2 Example

Ordway Bird Data

• data on ≈ 7000 birds (weight, time of capture, species, sex, etc.)

• data cleaning required

Example tasks (all answered with plots)

1. How many total birds per month?

2. How does the weight differ by species, wing chord and tail length?

• Make a scatter plot of mean weight by mean wing chord foreach species using color is tail length, and the diameter is thestandard deviation of weight.

• Make a similar scatter plot of the individual birds (leaving offsd). Compare with the previous plot.

3. Any trends over hour of the day?

4. Within species, does mixture of sexes depend on the time of year?

5. How would you identify a migratory species?

Where are we now? What is Big Data? Some First Steps Questions

Some Questions

1. What is big data?

• Size? Structure? Hygene? Workflow?

2. What are the key (big) data skills?

• Programming? Databases? Concepts?

3. Who needs (big) data skills?

• Science Majors? Stat Majors? CS majors? Special subsets?

4. When/how will these students get these skills?

• Special courses/programs? Thread through all courses?

5. Who will take the lead on Big Data Education

• Statisticians? Computer Scientists? Natural Scientists?

Where are we now? What is Big Data? Some First Steps Questions

Some Questions

1. What is big data?

• Size? Structure? Hygene? Workflow?

2. What are the key (big) data skills?

• Programming? Databases? Concepts?

3. Who needs (big) data skills?

• Science Majors? Stat Majors? CS majors? Special subsets?

4. When/how will these students get these skills?

• Special courses/programs? Thread through all courses?

5. Who will take the lead on Big Data Education

• Statisticians? Computer Scientists? Natural Scientists?

Where are we now? What is Big Data? Some First Steps Questions

Some Questions

1. What is big data?

• Size? Structure? Hygene? Workflow?

2. What are the key (big) data skills?

• Programming? Databases? Concepts?

3. Who needs (big) data skills?

• Science Majors? Stat Majors? CS majors? Special subsets?

4. When/how will these students get these skills?

• Special courses/programs? Thread through all courses?

5. Who will take the lead on Big Data Education

• Statisticians? Computer Scientists? Natural Scientists?

Where are we now? What is Big Data? Some First Steps Questions

Some Questions

1. What is big data?

• Size? Structure? Hygene? Workflow?

2. What are the key (big) data skills?

• Programming? Databases? Concepts?

3. Who needs (big) data skills?

• Science Majors? Stat Majors? CS majors? Special subsets?

4. When/how will these students get these skills?

• Special courses/programs? Thread through all courses?

5. Who will take the lead on Big Data Education

• Statisticians? Computer Scientists? Natural Scientists?

Where are we now? What is Big Data? Some First Steps Questions

Some Questions

1. What is big data?

• Size? Structure? Hygene? Workflow?

2. What are the key (big) data skills?

• Programming? Databases? Concepts?

3. Who needs (big) data skills?

• Science Majors? Stat Majors? CS majors? Special subsets?

4. When/how will these students get these skills?

• Special courses/programs? Thread through all courses?

5. Who will take the lead on Big Data Education

• Statisticians? Computer Scientists? Natural Scientists?

Where are we now? What is Big Data? Some First Steps Questions

Where are we now? What is Big Data? Some First Steps Questions

Our ability to statistically analyze data has grown significantly with thematuring of computer hardware and software. However, the evolution ofour statistics capabilities has taken place without a correspondingevolution in the curriculum for the undergraduate chemistry major. Mostfaculty understands the need for a statistical educational component, butthere is little consensus as to the exact nature of what is to be taughtand who should teach it. Because of the large number of coursesrequired for the undergraduate chemistry major, it seems unlikely thatrequiring a course on statistics will be practical at most institutions.Additionally, it is unlikely that the typical high school education willaddress the needed statistics or the software training to prepare studentsfor the chemistry courses. Therefore, the chemistry faculty must teachthe statistics needed by the majors. The faculty needs to focus onstatistics useful to the chemist and this is distinctly different than what isoften encountered in biology, medicine, psychology, and business. Astarting point is suggested for a discussion on a statistics curriculum thataddresses the needs of the chemistry majors.

Nicholas E. Schlotter, “A Statistics Curriculum for the UndergraduateChemistry Major,” Journal of Chemical Education 2013 90 (1), 51-55.

Where are we now? What is Big Data? Some First Steps Questions

Some Questions

1. What is big data?

• Size? Structure? Hygene?

2. What are the key (big) data skills? (Programming?

• Programming? Databases? Concepts?

3. Who needs (big) data skills?

• Science Majors? Stat Majors? Special subsets?

4. When/how will these students get these skills?

• Special courses/programs? Thread through all courses?• Taught in subject area or by external specialists?

5. Who will take the lead on Big Data Education

• Statisticians? Computer Scientists? Natural Scientists?

6. Must we walk before we run?

• Can/should we get to the good/big stuff right away?

Where are we now? What is Big Data? Some First Steps Questions

Some Questions

1. What is big data?

• Size? Structure? Hygene?

2. What are the key (big) data skills? (Programming?

• Programming? Databases? Concepts?

3. Who needs (big) data skills?

• Science Majors? Stat Majors? Special subsets?

4. When/how will these students get these skills?

• Special courses/programs? Thread through all courses?• Taught in subject area or by external specialists?

5. Who will take the lead on Big Data Education

• Statisticians? Computer Scientists? Natural Scientists?

6. Must we walk before we run?

• Can/should we get to the good/big stuff right away?

Where are we now? What is Big Data? Some First Steps Questions

Thanks

All of this is a work in progress that would not be as far as it is and willnot get as far as it can go without the help of others.

• Co-conspirators

Danny Kaplan Libby Shoop Nick HortonMacalester C Macalaster C Amherst C

• The Computation and Visualization Consortium

• My science colleagues at Calvin

• The team at RStudio