Megan Taylor

front-end dev, volunteacher, news & data junkie, bibliophile, Flyers fan, sci-fi geek and kitteh servant

Questions About APIs and Dirty Data and Best Practices

So I’m working on this Farmers Market Locator project, and I’ve got a pretty basic version up and running. Everything is client-side. And I’m using the New York State Open Data API to get the information on the farmers markets. Right now, all that happens when you use the site is a query to Google to find out where you are, and then a query to the NYSOD API to get the nearest markets’ info, and then some Google Maps API stuff. But some of the data is a little dirty: missing spaces in addresses, that kind of thing.

More complications: There’s a bunch of features I want to add, some of which involves incorporating data from the USDA Farmers Market API. Now the USDA API has a little more info about the markets, like what kind of products are sold at each market. This data might also be pretty dirty. And as far as I can tell, the only way to match up markets between the two APIs is to do string matching. (Meaning that, having determined that the Union Square Farmers Market is the closest to your location, I then have to search the USDA API for “Union Square Farmers Market”)

Is there a better way to approach matching up the markets between the two different APIs? Do I need to switch to a back-end solution? What’s the best way to clean up the dirty data? Should I be pulling this info into something else, like a Google Spreadsheet, clean it up there, and make queries to the spreadsheet instead of NYSOD?

I don’t even know how to approach this.

Edit: Adding links to raw JSON.

http://data.ny.gov/resource/qq4h-8p86.json

http://search.ams.usda.gov/farmersmarkets/v1/data.svc/zipSearch?zip=10008

Edit: Some suggestions have been made…but as usual they only spawn new questions.

  • Best way to match markets:
    • search name (problem: not standard)
    • search address (problem: not standard)
    • use URLs as keys (problem: what URLs??!!)
  • Cleaning up dirty data:
    • You can use a back-end solution with some data store OR you could do it all in JS. If you were to do it all in JS you would just have to call a few API’s, compare the data returned, fill in missing data from one API with data from another. You would also need to decided which would have priority if there was a conflict. (problem: wouldn’t doing all that matching, comparison, and cleaning on the client-side make it slower?)
    • I think you are wise to consider grabbing the data, cleaning it up and storing it under your own backend. This allows you to keep your app up and running even if they change their format. (sure your data might get stale, but probably not by much, and it would buy you time address any new formatting from your sources) You could pull down the source data, create your combined json set, and just use that. (depending on how huge all the data was) While a backend solution might sound really complicated, it doesn’t have to be. A google spreadsheet could work, or Google Fusion tables, or Parse.com, FireBase.com, or Tableau.

So far the best suggestion is pretty close to my initial idea, but I’m hoping to get some more feedback on this before I commit myself. Chime in!

November 11, 2013 | Comments Off on Questions About APIs and Dirty Data and Best Practices | Categories: Posts | Permalink

Comments are closed.

%d bloggers like this: