Mining the Ngram Viewer

Google Books Ngram Viewer is a nifty tool that analyzes all the text of all the books Google has digitized (over 25 million and counting) and lets you see the relative frequency of words going back to the 1600s.

What isn’t immediately obvious to most people is what you can do with Ngram Viewer — what kinds of insights you can glean from analyzing the text within books. I don’t have an easy answer, but here are a few ways to search Ngram Viewer. Leave a comment and let me know what you’ve been able to do with this intriguing research tool.

Compare the relative popularity of concepts over time.  For example, you can compare the frequency of the words progress, tradition and innovation over the decades. To make this more intriguing, note the different results when you limit your search to British English or American English.

Search for any word that appears near a specific word. If, example, you were researching the nursing profession, you might want to see what words most commonly precede the noun “nurse”. Using the syntax *_NOUN nurse, you can note the spike and subsequent drop in frequency of “head nurse” in the middle of the 1900s.

Compare the prevalence of a concept in fiction against all English-language text. If you want to see how frequently doctors and nurses are mentioned in fiction, the query nurse:eng_2012,nurse:eng_fiction_2012,doctor:eng_2012,doctor:eng_fiction_2012 will show you that doctors get the most press.

Granted, the syntax requires you to channel your inner programming nerd, and it takes some creativity to figure out how to use the Ngram Viewer. If you want to dig even deeper into all of its capabilities, check out the advanced search page.

Is Google feeding confirmation bias?

snippetIn its ongoing effort to answer the world’s questions (and sell ads), Google has been putting increased emphasis on its “featured snippets” – the little boxes of text extracted from whatever source Google has calculated to be most relevant. If I want to see whether my dogs can catch the flu, I can quickly see that, yes, it’s possible.

However, a recent Wall Street Journal article (“Google Has Picked an Answer for You—Too Bad It’s Often Wrong“) looked at the increased frequency of these quick answers that appear at the top of search results. (Note that these are not the Knowledge Panels, which are sourced from Wikipedia and other neutral sources.)

According to a study commissioned by the WSJ, these featured snippets are often excerpted from unreliable or biased sources. Google’s algorithm favors a web site with text that most exactly matches the query; as a result, the researchers found that the extracted text was more likely to come from a less-authoritative, biased or dodgy clickbait site.

Worse yet, since featured snippets are designed to closely match the query, they can feed confirmation bias. The featured snippet for the query “is milk good for you” says “Milk can be good for the bones because it provides vitamin D and calcium…” The featured snippet for the search “is milk bad for you” says “Animal milk has long been claimed as the go-to source of calcium by the dairy industry, but as it turns out, milk is bad for you .”

Since most people have been trained by Google to trust the first answer that appears, it’s even more important to practice some information hygiene before relying on the first answer from a search engine.

Do you trust that news?

skeptical woman looking over her glassesIn its ongoing efforts to address the scourge of misleading and false news, Google recently announced a new feature that helps readers evaluate a news source they may not be familiar with. Now,  when you search for a particular publication, the Knowledge Panel – that preformatted answers box that often appears at the top of search results – includes information about that publisher.

Depending on the publication, that can include awards they have won, the topics they cover most extensively and their political alignment. If content from the publication has recently been reviewed by an authoritative fact-checker, those items are also featured in the Knowledge Panel. [UPDATED: This seems to work in Google Chrome and Safari, but not Firefox. Thanks, Pam Wren, for the heads up!]

So, for example, if you Google “Wall Street Journal”, your search results page will include a Knowledge Panel like this:

wsj

You’ll see a one-sentence blurb from the Wikipedia article about the newspaper, links to professional awards for reporting, and a summary of the topics they have recently covered — in the case of the Wall Street Journal, that’s the Federal Reserve, advertising, sales and taxes… about right for a newspaper described as business-focused.

And if you Google “Breitbart”, your search results page will include a Knowledge Panel like this:

breitbart

If you click the link for “Writes About”, you’ll see that Breitbart has recently covered Donald Trump, Barack Obama, the Republican Party and Hillary Clinton… what you might expect from what the Wikipedia article describes as a “far-right American news, opinion and commentary website”. But note the “Reviewed Claims” tab, highlighting reported facts that were then determined to be false by fact-checkers like Snopes, Politifact and FactCheck. This stands out as a concern — most news sources’ Knowledge Panels don’t include lists of reported facts that were questioned and reviewed by fact-checking sites.

This is a great way for librarians and information professionals to instill a little FUD (Fear, Uncertainty and Doubt) when their clients assume that whatever they see on their Facebook feed is reliable. And check out Vanessa Otero’s infographic, What, Exactly, Are We Reading?,  a nice chart of where various media sources fall, both in terms of reliability/fabrication and liberal/conservative.

For Google, it’s location, location, location

I know… all Google is trying to do is help you get “better”, or at least more relevant, results from a search. And Google has assumed that you are your location — that where you are searching from really matters. Much of the time, that’s great. But for us professional searchers who search outside our own country, Google has just made a change that will significantly affect our search strategies.

Until now, if you wanted to focus your search on results from the UK and you were located in the US, you would go to the UK version of Google at google.co.uk. And yes, you’d always get different results than when you ran the identical search in google.com. However, according to a recent Google blog post, this trick will no longer work.

Now the choice of country service will no longer be indicated by domain. Instead, by default, you’ll be served the country service that corresponds to your location. So if you live in Australia, you’ll automatically receive the country service for Australia, but when you travel to New Zealand, your results will switch automatically to the country service for New Zealand. Upon return to Australia, you will seamlessly revert back to the Australian country service.

There’s a workaround; go to Settings and select Advanced Search. Scroll down to “Then narrow your results by…”, pull down the Region menu, and select the country you want to use to focus your search.

step-1       step-2

I tried this out with a search for Brexit, first searching in google.co.uk, then in google.co.uk with the Region set as United Kingdom. And, just curious to see if it would make a difference, I tried a third search in google.co.uk after setting my VPN to connect in the UK. And I got different search results from all three searches. Below are the top search results, highlighting the results that only showed up at the top of one of the three searches. Note that each search turned up results that weren’t in the top of the other two.

compare.jpg

Bottom line: I’ll now be doing three searches when I’m using Google to find information from a specific country or region. Thanks, Google…

 

Super searching tips

I just got back from Internet Librarian 2017 (in beautiful Monterey, CA — tough assignment). Among the insights I’ve brought back are:

Google Image search is focused more on matching meaning than matching images. If you want to search for instances of an image (to watch for usage of your organization’s images or to find mentions of a chart or graph in a report or article, say), you’re better off using a reverse-image search tool like TinEye instead.

A use of reverse image search I don’t often remember is to see if you’re looking a legitimate profile in social media or a fake. Right-click the person’s image, copy the URL and search for other instances of that image. If it’s a fake profile, it’s likely that whoever set up the profile used an image that appears elsewhere on the web, often a stock photo.

Remember Google’s undocumented (i.e., not in Google Help) prefix searches.  You can use intext: to look for words in the body of the page, intitle: for words in the title, inurl: for words that appear in the URL itself; and inanchor: for the words that appear in the anchor text (the text that’s highlighted in a hyperlink). Remember that you can’t have a space between the prefix and your search term — use intitle:asteroid to find web pages that have the word asteroid in the title, for example.

And I just learned about a new top-level domain – .graphics, so you can look for web pages specifically pertaining to computer and data graphics by searching for site:*.graphics.

When researching a topic, consider whether you want to search by process (how do I do this activity/thing?) or outcome (how can I get this result?). You’ll use different words and find different results based on which perspective you take.

You can also see my slide decks from IL2017 — Super Searcher Strategies and (ROI) Truth to Power.

Using Google Trends for CI

On Nov. 25, the Wall Street Journal had an article about finding the best “door-buster” items for Black Friday and Thanksgiving weekend sales. A graphic accompanying the article caught my eye – it showed dramatic spikes in Google search activity for a particular brand of women’s boots every year at the end of November… just around Black Friday.

This graph was generated by Google Trends and, while it wasn’t the focus of the article, it got me thinking about the usefulness of Google Trends in identifying marketing opportunities. Imagine what you would learn if you searched for your key products or services, or those of your competitors. If you learned that your customers were looking for information about a competing product during a predictable time period, wouldn’t you want to time your communications to be talking with your market right then?

 

More super-searcher tips

searchingIt’s the beginning of conference season for us public speakers… along with the daffodils appear boarding passes and PowerPoint slides. One of my favorite conferences is Computers in Libraries, and I will be leading the Searcher Academy pre-conference workshop as well as giving a regular presentation on super searcher tips.

I have more tips than I could fit into a blog post; here are a few of my favorites that I will be sharing at Computers in Libraries:

* All of us consider ourselves to be above-average Google searchers. However, there are times when you can be too clever for Google and wind up with unexpected results. Say your search logic is  (A and B) OR (C and D)(Australia AND snakes) OR (Colorado AND mountain lions), for example, if you were comparing the dangerous animals of two regions. However, this search gets translated into logical gibberish by Google — Australia AND (snakes OR Colorado) AND mountain lions. You will get better results by separating your query into two different searches.

* How you word your search matters – a lot! I was looking for information on Uber’s market strategy and found dramatically different results with the following three seemingly similar queries: Uber market strategyWhat is Uber’s market strategy and Uber “market strategy”. Always try several versions of your query, as there is surprisingly little overlap among the results of similar searches.

*Use MillionShort.com if you are researching an obscure topic, an individual, or looking for any kind of long-tail resource. This search engine lets you eliminate from your search result any of the million most popular web sites. You can also filter out any sites that have advertising or that appear to be e-commerce sites, which can be an effective way to find the web site for a small non-profit or a group committed to a cause.

You can see some of my prior super-searcher tips here and here.

Super-searcher secrets

I am a business researcher and analyst by training, and I’m always on the lookout for new and creative ways to find the more “hidden” information about companies and individuals. Here are a few of my latest favorites.

Patent Insights

A recent webinar hosted by the Association of Independent Information Professionals  gave me some great ideas on how to use patent information to glean business intelligence. Using ProQuest Dialog, presenter Darla Agard talked about various ways to find the unexpected and identify new opportunities using techniques I hadn’t thought of using. While her examples were from PQD, much of the data mining can be done in other professional online services as well. I particularly appreciated that most of these tips incurred no expenses.

Say you want to figure out where a company’s R&D centers are located. junkThis is usually not something a company will disclose in its annual report or on its web site, but if the company holds patents a smart researcher can deduce the location of the centers by using PQD’s search filtering tools. As Darla explained it, use the INPADOC database to find all the patents held by the target company. In the search results page, scroll down along the right margin to “Narrow your search by” and expand the tab for “Patent assignee country”. This list shows the countries in which most of the patent assignees are located; this may indicate in which countries the company most likely has an R&D center. Likewise, the tab for “Patent publication country” is likely to indicate the manufacturing and distribution centers.

M&A Activity

To get a rough idea of mergers or acquisitions a company has been involved in, search a business database such as Gale Group Trade & Industry for the company name, then limit the results to articles with the subject Company Acquisition/Merger. Scan the titles of the resulting set to see what companies are mentioned with the target company. It isn’t perfect, but it gives you a nice snapshot of recent activity.

Identifying Competitors

junkTo identify competitors in a small market, I sometimes use LinkedIn. I pull up the company page for any player I know of in the field. In that company’s LinkedIn page, I look at the box along the right margin for “People Also Viewed” — a list of the other companies that people looked at while they were looking at this company. While it’s not a comprehensive search, this gives you a start on identifying some of the other companies that share consumer mind-space.

Try Facebook Search Again

Facebook recently announced it was making all publicly-viewable posts searchable through what it calls “universal search”. It’s not quite universal; it doesn’t include any posts not marked as public, and only English-language posts are included in the archive. You can’t limit your search to phrases; Boolean operators like AND and OR don’t work; and you can’t filter or limit your search by date, nor can you sort the results by date. That said, if you are looking for topics that are narrowly focused or a small company, you may find useful pointers through Facebook’s newly-expanded search.

When is a headline not a headline?

I’m one of those people who still reads newspapers. Even worse, I still get the print newspaper delivered to my doorstep every day. I could wax eloquent about the tactile pleasure and serendipitous delight of paging through a print newspaper, but I’ll spare you.

Often, I find an article thought-provoking enough that I want to share it and my thoughts to the world. Easy – I pop online, find the digital version of the article, and no one needs to know I saw the article first on a dead tree.

However, I ran into problems recently when I looked up the online version of a Wall Street Journal article and couldn’t retrieve it by searching for the title. Eventually, I found it; my problem had been that the digital version had an entirely different title than the print. Curious, I compared a week’s worth of print and digital headlines of WSJ articles and found that fewer than a quarter of the headlines were the same. While some of the titles were similar (“Boeing Scrambles to Get Key Part” and “Boeing, Supplier Wrestle to Produce Key Component”), others were entirely different (“Bye, Boss, Let’s Stay Friends Forever” and “How to Leave Your Job Gracefully”).

When I asked Dow Jones about the discrepancy between print and digital headlines, I got the following not-entirely-satisfying response:

At the Journal we are constantly refining our approach to headlines to ensure that our readers are automatically drawn to our work, whether printed on a page or comprised of pixels.  We often fine-tune headlines in order to reflect developing news and improve SEO, but at the end of the day we are always looking for the best blend of digital optimization and smart journalism.

HERE is a table of the headlines from my one-week sample. Lesson: to find the digital equivalent of a print article quickly, search for a few unusual or distinctive words in the text rather than for words in the headline.

 

Two super-searcher tricks

searchingI’ve been an online searcher since the 19-mumbles, and I’m still learning new search tricks. Here are a couple of tips for mining online databases that I picked up from Cynthia Hetherington, a Big Kahuna in the private investigative world, during an excellent webinar on due diligence she gave for AIIP.

When you are exploring a new resource for information on individuals and want to figure out how far back in time the dataset goes, try searching for a common name like Smith. Since it’s a safe assumption that there will be Smiths in even the earliest records, you can just sort the search results in chronological order from earliest forward, and you’ll probably see the first year of coverage. You could use the same approach with any other type of database — just search for something that is likely to occur very frequently, and then see the date of the earliest record you retrieve. Searching an export database? Try a common export like machinery. Checking out a database of news articles? Search for the word President.

Another trick I learned from Cynthia relates to those times when you’re looking for reliable information on a topic and keep turning up too much irrelevant material. Try restricting your search to only government sites by adding to your search the phrase site:gov. Sure, it’s a very restrictive search and probably won’t turn up a lot of results, but the sites you do get will probably be useful. Cynthia recommended using this technique when looking for public records on individuals and needing to weed out all the resellers of government data.

(I was surprised at how useful this was. When googling bull snakes, having found one living in my backyard, I wasn’t finding much reliable information. Even the Wikipedia entry was full of “citation needed” notes. Limiting my search to .gov or .edu sites, I turned up several useful articles from university extension services and state government web sites. Bull snakes are our friends!)

[ADDED: Tara “ResearchBuzz” Calashain reminded me that she built an awesome Google Custom Search Engine that limits the search to just US states, counties and cities. See her description here.]

What search tricks have you learned recently?