About a week and half ago, classes began at Wayland. I have also started my one day a week position at Texas Tech. For those who may not know, I took a bit of a sabbatical from teaching, without knowing at the time that it was a sabbatical. It had been my intention to try my hand at full time research. A few of things motivated my leave from Wayland last May:
- I enjoyed research both as a graduate student and during my two summer post-doc positions so it seemed natural that I would enjoy more research. I was often frustrated during the school year by constantly having to put my "deep thinking" on hold for class, lecture development, one-on-one student tutorials, exam grading, etc. If I wanted to do "real" research I was going to need a full-time position to commit the necessary time.
- I had never had a full-time research position before so I didn't know if it was something I would truly enjoy. Having started my full-time position at Wayland just after the completion of my Master's, I lacked any experience as a researcher other than, again, part time research with the constant interruption of my teaching responsibilities.
- The time was now. If I was going to move my career in the direction of research, I needed to try it now. Nobody would hire a Ph.D. who hadn't been doing research for years and who hadn't taken the opportunity to pursue a post-doc position immediately after graduation. I didn't want to look back later, perhaps during a mid-life crisis, and wonder, "What If?"
I took a position as a Bioinformacist at the level of a Post-Doctoral Research Associate in a new lab at Texas Tech University. The lab was set up by Dr. Thea A. Wilkins who had recently been hired by Tech and who has a pretty significant standing in the world of cotton genetics research. I took the position expecting to be a part of her lab for two to three years. At the end of that time, I would evaluate my career options and determine if research was for me, if I ought to return to teaching, or if I should seek a position with a mix of the two. As is now obvious since I am back a Wayland, I learned a few interesting things about research and about myself during the 6 months in this position, things that have rekindled my passion for mathematics education.
- Bioinformatics research, particularly data mining and database construction, do not motivate novel research, at least not for me. I honestly believe that the skills of using statistical analysis for obtaining biological knowledge from the current high-throughput technologies is not an area that will remain as a primary research field for long. As I have read more than once, this kind of bioinformaticist will soon be relegated to the position of technician, perhaps as a microscopist has become. It will be a specialized set of skills that fit within the larger context of molecular biology and functional genomics, but not as an active field of research in its own right.
- Successful research demands passion complete with a time commitment beyond my capacity as a family man. I'm sure there are a few select individuals that manage to build successful research programs will maintaining their familial responsibilities but they are indeed an exception. The stress of publishing, obtaining grants and finding the next big idea commands a mental focus that one cannot honestly say they are putting their family and faith before their career. Even if one can successfully balance this career with their family, that individual must be strongly motivated in both arenas of life. I am passionately motivated to be a Christian father and a Christian husband, but cannot say the same thing of bioinformatics research.
- I am passionate about teaching. If a career in education held the same demands as research, I would actually be motivated to balance this career with my family. Fortunately, while it still requires commitment and an enormous amount of work, undergraduate mathematics education gives me much more freedom to commit time and mental focus to needs of my wife and rearing of my children. Sitting in front of a computer screen writing code, building spreadsheets, developing databases, writing proposals, at 10+ hours a day, 5-6 days a week, only made me long for the classroom and for my office hours where I could teach, respond to questions, and just interact with people. In the first week at Wayland, I had in-depth conversations with more people than in 6 months at Tech.
- The work environment at this particular lab at Texas Tech is not conducive to productive research. I'll not say more than that other than to say that I am not the first or the second person to leave the lab prematurely within the last six months. At least a part of the reason for this exodus is the management of the lab. Enough said.
This post was originally intended to discuss my first week at Wayland but it looks like that will be forthcoming. I'll draw this post to close and conclude that God let me follow a path of self-discovery. I would even go so far as to say that He wanted me at Tech. He also left a place for me at Wayland so I could return.
I still must confess to doubt myself at times, wondering if I am just make excuses for not wanting to work hard. I remember a bad decision I made during the summer following my junior year at Wayland. I was living with my grandparents in Amarillo and completing an English course at Amarillo college. I was trying my hand as a temp. I had a job with a firm doing data entry. I was asked if I wanted to take it as a full time summer position. After my first day, which was tedious and gave me a serious migraine, I convinced myself that I wasn't cut out for data entry. I convinced myself that it just not worth the challenge of monotony that it would inevitably provide. Looking back (hindsight is 20-20) I really think I should have kept the job, saved up my earnings for an upcoming wedding, honeymoon and down payment on a house. I needed to learn the responsibility of a "real" job outside the world of academia. There are things that I learned at my first full time, non-teaching job this past 6 months, that I should have learned a long time ago.
Didn't I say I was drawing this post to a close? Well, I will, saying this: God is at work in our lives and we can trust his plan to work itself out in us. We simply must give him the reigns to our life and make Jesus our Lord, master, boss, chair, dean, and all-around head honcho. (Easier said than done.)
We are utilizing a new technology in our lab. The GeXP, among other things, provides a resource for the confirmation of MicroArray data.
Basically, microarrays measure the level of expression for tens of thousands of genes on a single slide for a given sample. In many cases, a small sample of this data that represent a significant differential expression between a sample and a control will be confirmed through PCR (polymerase chain reactions). The drawback of PCR is that only one gene can be analyzed in a single well on a plate. However, the GeXP provides a way to use gene multiplexing to measure gene expression in a similar way but for more genes per well.
I went through some extensive training on this technology and we completed our first experiment (after much trial and error) and I completed what I considered a relatively simple analysis comparing the results from the GeXP with the results of the MicroArray. I simply confirmed that there was a significant linear correlation between the two technologies using the standard test statistic based on Pearson's correlation coefficient.
I sent a question to my trainer from Beckman Coulter to ask for advice on what type of analysis should actually be done to compare the technologies. I was careful not to tell him what I did because I wanted his independent opinion on the analysis.
Do you want to know what his suggestion was?
"I think you should use bar graphs."
Bar graphs? That's the deep analysis I need. Of course, how could I not consider drawing pretty pictures as a way of demonstrating the reliability of these technologies. Bar graphs are my dream come true.
I honestly laughed out loud in learning the depth of thought given to the data analysis by the folks at Beckman Coulter. I'm waiting to hear if his team (supposedly of technical analysts) might provide deeper insight. I'm still waiting.
I don't mean to sound too critical but if this had been a question on a test in my Elementary Statistics courses, it would definitely not pass muster with me. It just shows a lack of understanding of the importance of statistics in data analysis.
I am being introduced to a side of research that is making me more and more uncomfortable. I know it's probably hard to believe but I've made it all the way through a Ph.D. program and didn't see a lick of the hoarding of information, especially like I see it now in the field of biology (and bioinformatics).
I've seen a number of blog entries covering "Open Science" on the bioinformatics blogs that I keep up with. However, I've not been reading them in depth, only just skimming them. I had the naive notion that all the people I work with would be wide open with their work, especially within the same lab. Such is not the case.
On one side, I can completely understand. Say, I have a colleague who is working on a particular project and puts in a significant amount of work to produce a certain conclusion. There are many steps along the way for which that colleague developed the analysis to reach the conclusion. However, this same analysis can be repeated on other work. Whether it is code or simply a methodology/protocol, that work is significant and can make a name for that colleague. Does he share this with me or others in the lab so that it can further my own projects? Or, do I give him my data and let him do his analysis. Surely, even if he handed over all his information, I would give him credit. In a perfect world, we all receive credit for the work we do. However, what if I don't plan on giving him credit. What if I want to make a name for myself and take his techniques produce work and give him no credit at all?
Deep down in my heart I am not a open source kind of guy. I am full-blooded capitalist, through and through. I believe in competition. Of course, there are ethics that provide boundaries for such competition, but I believe that healthy competition can produce quality results. I am not saying that competition and "open science" must be mutually exclusive, as it is possible to race to a result or conclusion with your information being fully available to the other party. Ultimately, the ideal goal is the same for all scientists: true understanding of the physical world. Nevertheless, human nature will prevent scientists from a selfless pursuit of truth.
By entering the realm of full-time research I need to come to some decisions on just what those boundaries are. There is a line somewhere between 100 percent open science and proprietary, commercial science that must be drawn and supported. So, now I'm going back to read those blog posts on open science and to decide just where I stand.
I am still in the phase of my new job where I am inundated with new information every day. Every so often I just get overwhelmed with the number of different ways to do something, whether it's the number of tools available for the job or even the number of different methodologies. I have reached the point that I officially over-use the phrase, "we don't have to re-invent the wheel." I am banning it from my vocabulary.
One of the issues I ran into this week was at the other end of the spectrum of confusion. The issue was not too many methodologies but too few. Maybe I am missing something, but the answer to too many of my questions is simply "BLAST". For those that read this without any bioinformatics background, BLAST stands for Basic Local Alignment Search Tool, which is an algorithm that takes a gene and search for the best matches in some gene database. It returns a number of statistics which the bioinformatician uses to determine what is the most likely best match to this gene in a given genome. For example, say I have a gene with an unknown function and wish to identify that function, one preliminary step is to BLAST that gene against databases full of genes with known function. If I find a significant "hit" or match, I may be able to hypothesize that this gene has that same function as the matching gene. How I come to that conclusion is based largely on how "good" of a hit is, whether the hit occurs in the functional domain of the gene, and probably several other factors I have yet to learn.
In one of my projects the basic approach is to determine the cross-species similarity between two plants, at the genetic level. Using expression data for one plant, I want to determine which genes play a similar role in another plant. So, I BLAST. Here's the kicker, for a large percentage of my genes I have multiple good hits in the other species, which indicates similarity such as among gene families. Which genes do I include as having matches? How good of a hit is good enough? My current methodology just takes the top hits that have an e-value less than 1e-20 (fancy way of saying that the probability of that hit in randomly distributed sequences is very low). By this methodology, out of about 50,000 genes, 28,000 of them have hits in the other plant. But I also leave out 19,000 genes from the other plant that would have hits with an e-value of less that .00001, which is still a very good probability of a hit. Is it possible that these are also genes with the same function? Of course.
So here's the conclusion I reached this week. Say I have a large database of genes and I want to identify genes with a certain property, characteristic or function. It is highly unlikely that I will find ALL the genes meeting my criteria. Instead, I set a threshold so that when search for this criteria, all the genes I obtain I can be relatively certain meet that criteria, even though I may be leaving out a large number of satisfactory genes. I want to make sure that the genes I select are satisfactory, but make draw no conclusions about the genes I leave out. That's where I draw the line.
I finally managed to get a glimpse of the big picture when it comes to fiber research. It may seem fairly obvious now but for some reason, I was too immersed in the A's, G's, T's and C's of the genetic code and hunting for their patterns, that I forgot to ask the big question: Why?
Answer: Better cotton and more of it.
In a talk by one of my colleague's, he laid out a bit of this for me by describing two of the primary cotton species that are most harvested. The first is called Gossypium hirsutum, which is the sometimes called Texas Maker 1 (TM-1) or Upland. It's name comes from "hirsute" meaning "hairy". This cotton species has a high yield, a definite advantage, but as the name suggests the fibers are "hairy", not as long and strong as other species. Another variety is Gossypium barbadense, or Pima, which is commonly known as "Egyptian Cotton". Its fibers are longer and stronger resulting a higher quality fabric. However, there is substantially less yield from this variety.
Now, in comes research in the study of the cotton genome. The goal is to understand the biological mechanisms and the underlying genetic code that produces the differences in the varieties of cotton. If we can identify significantly differentially expressed genes in varieties of cotton at different stages of development, and use this information to discover active biological pathways, we may be on our way to understanding the system of biological development in cotton. Then, knowing that, we will work to produce a cotton plant with the yield of Upland and the quality of Pima.
What I learned this week #1:
Another week has transpired at my new job as a researcher at TTU and I am being inundated with all sorts of new information. For one thing, I have folks milling around behind my workstation doing all sorts of laboratory things, freezing things in liquid nitrogen (or something else very cold that billows smoke), pipetting (if that is even a word), etc. I'm just disappointed they they're not wearing lab coats with a mad scientist look in their eye. Unfortunately, my job is much less exotic-looking. I sit in front of a computer, all day long. I do have a pretty fancy set up with two 19-in LCD monitors plugged into a pretty hefty computer (two dual-core processors, 4 GB memory).
I have a number of different projects but they all seem to start in the same place and due to my lack of experience with the biology, I don't have a good feel with how to follow these initial steps. In essence, the geneticist I work for has a great deal of data collected about the expression of genes in cotton over various varieties of cotton and various developmental phases. So, I start with a list of genes that have been identified to have a particular function in another plant, the most common being Arabidopsis since its genome has been entirely sequenced. We then identify if these genes are present in cotton. Once we have this list of genes we then examine their developmental expression and draw conclusions about their role in cotton.
You know, when I state it like that it seems very simple but there are several steps in each of the above steps that can lead to a great deal of work. So far, I have very little to say about the conclusions we draw. So far, all I have really done is the first phase of identifying these genes in Arabidopsis and begin to compile the list of these genes in cotton. Next week, I'll begin collected the expression data for some of these genes of particular interest.
I should mention that the first project is actually slightly modified, in that we looked at genes that had specific roles identified first in cotton and then found whether these same genes played similar roles in other species. If any computational biologists, functional geneticists (is that a term?), or bioinformaticists read this and it seems naive, please be kind a realize that I don't speak the language very well yet. I'm absorbing as much as I can as fast as I can. Having a background in applied mathematics and numerical analysis helps but I still feel handicapped.
- I learned how to migrate a MSSQL database website to a new server.
- I learned how to update data in a MySQL database with a pre-built utility website (first time using the "DELETE", "SELECT" commands with a "LIKE" modifier)
- I installed PHP and MySQL to run on Microsoft IIS, followed by installing ActiveCollab for project management
- I used query design mode extensive in Microsoft Access and eventually resorting to SQL statements for "UNION" queries
- I updated the blast database used by NCBI wwwblast on a local utilities site
One of the broadest definitions of bioinformatics that I have come across was in Sorin Draghici's book, Data Analysis Tools for DNA MicroArrays.
Def: Bioinformatics is the science of refining biological information into biological knowledge using computers.
Under the heading of bioinformatics is a wide variety of different fields of study with a lot of problems under its umbrella. Some of the primary issues addressed, historically, have been sequence analysis, protein structure prediction and the dynamic modeling of complex biosystems. Other areas of fairly recent research has been in protein-protein interations, protein-DNA interactions, enzymatic and biochemical pathways, population-scale sequence data, large-scale gene expression data and ecological and environmental data.
That's what Draghici has to say about it, anyways. As a newbie, I am discovering all the time new avenues of research and trying to assimilate and categorize all the new information I'm coming across. So far, it looks like my research will begin in the areas of sequence analysis, modeling of biosystems, pathway analysis and analysis of gene expression data.
The first project in which I am involved is simply cross-species comparison of genes that have been identified in one species through expression analysis as having a role in cell wall development. We'll use this information to predict their role in other plants as well.
I'm already having quite a good time at my new job taking care of some techie stuff, more so than I ever had a chance to do in my last profession. I've tweaked a website design to meet the boss' requirements, migrated a database website from one server to another one (learning a good deal about MS SQL Server in the process), identified and solved a particular issue with the Genespring workgroup server.
I'm also utilizing a new software tool on my own server, something called activeCollab, a project management utility. It is an online database of my current projects and activities. Through a fairly easy to use web interface I am able to enter all my projects along their tasks, messages, milestones, etc. I've provided access to my colleagues and my PI to allow them to keep tabs on my progress. Plus, it helps me to make sure that I am staying on task and meeting all my own goals.
I've taken a post-doctoral research position in Bioinformatics at Texas Tech University. I am going into full-time research for a while. I have decided that the next step in my career is to pursue a larger commitment to research. That doesn't mean I am completely done with education. I have developed an online College Algebra course for Wayland and will teach as an adjunct for its Virtual Campus. I hope to continue this indefinitely but we'll just have to see how well such a course will work.
I hope to use this blog to document my progress in learning the field of bioinformatics and the issues I have had to address in the transition. Currently, I am learning Perl for the first time. Over the past summer, I did some work for my future boss and in most of the work I did, I utilized C# with its strong set of Regular Expressions tools. Since so many people are still using Perl for Bioinformatics, I feel it is necessary to get a good handle on it. So far, it's mostly syntax that I have to learn since many of the primary programming language components, I am used to, are all there. I've done text processing with C# so that aspect is not entirely new.
Next on the horizon, is to hone my skills in MySQL and Microsoft SQL Server. Down the road, I'll also have to be introduced to Oracle since we are using a software package from Agilent that uses it. I was asked the other day whether I know R, the open source statistical package. I have not ever used it but I did quick survey of what it can do and realized that a lot, if not most, of what you can use R for, I can already do in MATLAB. So, I'll be learning it as well along with the Bioconductor package which provides specific tools in R for computational biology.
I'll also use this blog to put interesting software tools that I come across and use on a regular basis.